From ccf509b56f21a2329a22a86f093afcb54caba48d Mon Sep 17 00:00:00 2001 From: Alex Wu Date: Sun, 27 Oct 2024 00:59:54 +0800 Subject: [PATCH 1/3] add information about deleting raw data in data_management.rst Signed-off-by: Alex Wu --- docs/user_guide/concepts/main_concepts/data_management.rst | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/user_guide/concepts/main_concepts/data_management.rst b/docs/user_guide/concepts/main_concepts/data_management.rst index bc492a56f8..444dc10da2 100644 --- a/docs/user_guide/concepts/main_concepts/data_management.rst +++ b/docs/user_guide/concepts/main_concepts/data_management.rst @@ -170,6 +170,13 @@ But for Metadata, the data should be accessible to Flyte control plane. Data persistence is also pluggable. By default, it supports all major blob stores and uses an interface defined in Flytestdlib. +Deleting Raw Data in Your Own Datastores +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Flyte does not offer a direct function to delete raw data stored in external datastores like ``S3`` or ``GCS``. However, you can manage deletion by configuring a lifecycle policy within your datastore service. + +If caching is enabled in your Flyte ``task``, ensure that the ``max-cache-age`` is set to be shorter than the lifecycle policy in your datastore to prevent potential data inconsistency issues. + Practical Example ~~~~~~~~~~~~~~~~~ From f086562a5e59a612da815a1b26b7d84a8e4d0781 Mon Sep 17 00:00:00 2001 From: Alex Wu Date: Sun, 27 Oct 2024 01:12:55 +0800 Subject: [PATCH 2/3] fix example code error Signed-off-by: Alex Wu --- .../concepts/main_concepts/data_management.rst | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/docs/user_guide/concepts/main_concepts/data_management.rst b/docs/user_guide/concepts/main_concepts/data_management.rst index 444dc10da2..7f3a423780 100644 --- a/docs/user_guide/concepts/main_concepts/data_management.rst +++ b/docs/user_guide/concepts/main_concepts/data_management.rst @@ -186,20 +186,19 @@ The first task reads a file from the object store, shuffles the data, saves to l .. code-block:: python - @task() - def task_remove_column(input_file: FlyteFile, column_name: str) -> FlyteFile: + @task(container_image=basic_image, cache=True, cache_version="1.0") + def task_read_and_shuffle_file(input_file: FlyteFile) -> FlyteFile: """ - Reads the input file as a DataFrame, removes a specified column, and outputs it as a new file. + Reads the input file as a DataFrame, shuffles the rows, and writes the shuffled DataFrame to a new file. """ input_file.download() df = pd.read_csv(input_file.path) - # remove column - if column_name in df.columns: - df = df.drop(columns=[column_name]) + # Shuffle the DataFrame rows + shuffled_df = df.sample(frac=1).reset_index(drop=True) - output_file_path = "data_finished.csv" - df.to_csv(output_file_path, index=False) + output_file_path = "data_shuffle.csv" + shuffled_df.to_csv(output_file_path, index=False) return FlyteFile(output_file_path) ... From 3d881879779fc88167d12b62fd126c9e67622b6e Mon Sep 17 00:00:00 2001 From: Alex Wu Date: Sun, 27 Oct 2024 09:54:28 +0800 Subject: [PATCH 3/3] delete example code task decorator arguments Signed-off-by: Alex Wu --- docs/user_guide/concepts/main_concepts/data_management.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/concepts/main_concepts/data_management.rst b/docs/user_guide/concepts/main_concepts/data_management.rst index 7f3a423780..81f86bb1c0 100644 --- a/docs/user_guide/concepts/main_concepts/data_management.rst +++ b/docs/user_guide/concepts/main_concepts/data_management.rst @@ -186,7 +186,7 @@ The first task reads a file from the object store, shuffles the data, saves to l .. code-block:: python - @task(container_image=basic_image, cache=True, cache_version="1.0") + @task() def task_read_and_shuffle_file(input_file: FlyteFile) -> FlyteFile: """ Reads the input file as a DataFrame, shuffles the rows, and writes the shuffled DataFrame to a new file.