You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently using HuggingFace Datasets to load, process, and save some data. However, HF Datasets saves the data in the Arrow format and wastes a lot of time converting between Arrow and PyTorch tensors.
I am wondering if I can use memory-mapped TensorDicts for this purpose?
How can I do a map on batches of TensorDict?
Looking through the tutorials, the nearest example I found was using a Dataloader with collate_fn as the map function:
This is on our radar!
I like the idea of having a (possibly multiprocessed) map to execute some transform over all the elements of a tensordict.
Stay tuned, I'll ping you once we have PR with this
I am currently using HuggingFace Datasets to load, process, and save some data. However, HF Datasets saves the data in the Arrow format and wastes a lot of time converting between Arrow and PyTorch tensors.
I am wondering if I can use memory-mapped TensorDicts for this purpose?
How can I do a
map
on batches of TensorDict?Looking through the tutorials, the nearest example I found was using a Dataloader with
collate_fn
as the map function:But this does not allow me to form a pipeline of
map
functions. I also don't know how to save and load the resulting dataset.The text was updated successfully, but these errors were encountered: