Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Ability to incrementally write a memmap tensordict #968

Open
1 task done
alexanderswerdlow opened this issue Aug 21, 2024 · 1 comment
Open
1 task done
Assignees
Labels
enhancement New feature or request

Comments

@alexanderswerdlow
Copy link

Motivation

A common situation for dataset generation/processing involves writing many tensors to disk from many processes/nodes in parallel, and over a long duration. While shared storage is assumed, the storage itself is often slow and has delays due to NFS caching, etc, and many small file ops cause inefficient operation. In addition, allowing the user to manually flush to disk can help alert the user to file I/O bottlenecks as it's clear what is blocking the code.

My current workflow with tensordicts is to generate 1 per process, periodically save to disk [by deleting the old one and creating a new one] and finally merging all individual tensordicts with a cat.

Solution

Support incremental writing/saving of a memmap tensordict. Writes should persist in memory until a manual flush occurs. The entire file shouldn't be overwritten so as to allow other processes to write to other portions of the tensordict in parallel.

Checklist

  • I have checked that there is no similar issue in the repo (required)
@alexanderswerdlow alexanderswerdlow added the enhancement New feature or request label Aug 21, 2024
@vmoens
Copy link
Contributor

vmoens commented Aug 27, 2024

In principle I don't see why it wouldn't be possible but the way we work with memmap is through torch.from_file, which does not return a traditional mmap object with flush functionality.

Happy to chat about it with @mikaylagawarecki and @albanD once I'm back from my time off!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants