Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSTimeoutError] load_dataset #7175

Closed
cosmo3769 opened this issue Sep 26, 2024 · 5 comments
Closed

[FSTimeoutError] load_dataset #7175

cosmo3769 opened this issue Sep 26, 2024 · 5 comments

Comments

@cosmo3769
Copy link

Describe the bug

When using load_datasetto load HuggingFaceM4/VQAv2, I am getting FSTimeoutError.

Error

TimeoutError: 

The above exception was the direct cause of the following exception:

FSTimeoutError                            Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/fsspec/asyn.py](https://klh9mr78js-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20240924-060116_RC00_678132060#) in sync(loop, func, timeout, *args, **kwargs)
     99     if isinstance(return_result, asyncio.TimeoutError):
    100         # suppress asyncio.TimeoutError, raise FSTimeoutError
--> 101         raise FSTimeoutError from return_result
    102     elif isinstance(return_result, BaseException):
    103         raise return_result

FSTimeoutError:

It usually fails around 5-6 GB.

Screenshot 2024-09-26 at 9 10 19 PM

Steps to reproduce the bug

To reproduce it, run this in colab notebook:

!pip install -q -U datasets

from datasets import load_dataset
ds = load_dataset('HuggingFaceM4/VQAv2', split="train[:10%]")

Expected behavior

It should download properly.

Environment info

Using Colab Notebook.

@cosmo3769
Copy link
Author

Is this FSTimeoutError due to download network issue from remote resource (from where it is being accessed)?

@crlotwhite
Copy link

It seems to happen for all datasets, not just a specific one, and especially for versions after 3.0. (3.0.0, 3.0.1 have this problem)

I had the same error on a different dataset, but after downgrading to datasets==2.21.0, the problem was solved.

@lhoestq
Copy link
Member

lhoestq commented Sep 30, 2024

Same as #7164

This dataset is made of a python script that downloads data from elsewhere than HF, so availability depends on the original host. Ultimately it would be nice to host the files of this dataset on HF

in datasets <3.0 there were lots of mechanisms that got removed after the decision to make datasets with python loading scripts legacy for security and maintenance reasons (we only do very basic support now)

@cosmo3769
Copy link
Author

@lhoestq Thank you for the clarification! Closing the issue.

@Epiphero
Copy link

I'm getting this too, and also at 5 minutes. But for CSTR-Edinburgh/vctk, so it's not just this dataset, it seems to be a timeout that was introduced and needs to be raised. The progress bar was moving along just fine before the timeout, and I get more or less of it depending on how fast the network is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants