Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document clarifying notes about the data lifecycle #5921

Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 13 additions & 7 deletions docs/user_guide/concepts/main_concepts/data_management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,13 @@ But for Metadata, the data should be accessible to Flyte control plane.

Data persistence is also pluggable. By default, it supports all major blob stores and uses an interface defined in Flytestdlib.

Deleting Raw Data in Your Own Datastores
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Flyte does not offer a direct function to delete raw data stored in external datastores like ``S3`` or ``GCS``. However, you can manage deletion by configuring a lifecycle policy within your datastore service.

If caching is enabled in your Flyte ``task``, ensure that the ``max-cache-age`` is set to be shorter than the lifecycle policy in your datastore to prevent potential data inconsistency issues.

Practical Example
~~~~~~~~~~~~~~~~~

Expand All @@ -180,19 +187,18 @@ The first task reads a file from the object store, shuffles the data, saves to l
.. code-block:: python

@task()
def task_remove_column(input_file: FlyteFile, column_name: str) -> FlyteFile:
def task_read_and_shuffle_file(input_file: FlyteFile) -> FlyteFile:
"""
Reads the input file as a DataFrame, removes a specified column, and outputs it as a new file.
Reads the input file as a DataFrame, shuffles the rows, and writes the shuffled DataFrame to a new file.
"""
input_file.download()
df = pd.read_csv(input_file.path)

# remove column
if column_name in df.columns:
df = df.drop(columns=[column_name])
# Shuffle the DataFrame rows
shuffled_df = df.sample(frac=1).reset_index(drop=True)

output_file_path = "data_finished.csv"
df.to_csv(output_file_path, index=False)
output_file_path = "data_shuffle.csv"
shuffled_df.to_csv(output_file_path, index=False)

return FlyteFile(output_file_path)
...
Expand Down
Loading