Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to map a TensorDict? #504

Closed
NightMachinery opened this issue Jul 31, 2023 · 1 comment · Fixed by #518
Closed

Is there a way to map a TensorDict? #504

NightMachinery opened this issue Jul 31, 2023 · 1 comment · Fixed by #518

Comments

@NightMachinery
Copy link

I am currently using HuggingFace Datasets to load, process, and save some data. However, HF Datasets saves the data in the Arrow format and wastes a lot of time converting between Arrow and PyTorch tensors.

I am wondering if I can use memory-mapped TensorDicts for this purpose?

How can I do a map on batches of TensorDict?

Looking through the tutorials, the nearest example I found was using a Dataloader with collate_fn as the map function:

DataLoader(
    tensor_dict, batch_size=batch_size, collate_fn=map_fn,
)

But this does not allow me to form a pipeline of map functions. I also don't know how to save and load the resulting dataset.

@NightMachinery NightMachinery changed the title Is there a way to map TensorDict? Is there a way to map a TensorDict? Jul 31, 2023
@vmoens
Copy link
Contributor

vmoens commented Aug 1, 2023

This is on our radar!
I like the idea of having a (possibly multiprocessed) map to execute some transform over all the elements of a tensordict.
Stay tuned, I'll ping you once we have PR with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants