Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset retriever fails when trying to download gated dataset #371

Open
saum7800 opened this issue Oct 12, 2023 · 5 comments
Open

Dataset retriever fails when trying to download gated dataset #371

saum7800 opened this issue Oct 12, 2023 · 5 comments

Comments

@saum7800
Copy link
Collaborator

Here is the output from my prompt_spec:

Instruction:
Task description: Identify a broad class given several examples from that class

Examples:
input=
Q: The similarity among DemandBase, InfusionSoft, and HotSchedules is that they are all 
output=tech companies

input=
Q: The Architecture of Open Source Applications, Algorithms to Live By: The Computer Science of Human Decisions, and The Art of the Start: The Time-Tested, Battle-Hardened Guide for Anyone Starting Anything can be classified as 
output=Computer Science books

input=
Q: Wrike, SEMrush, and Sprinklr are all 
output=tech companies

Got the following error when trying to retrieve the dataset

FileNotFoundError: Couldn't find a dataset script at /projects/tir5/users/ssgandhi/prompt2model/bigbench/bigbench/BIG-bench/bigbench/imagenet-1k/imagenet-1k.py or any data file in the same directory. Couldn't find 'imagenet-1k' on the Hugging Face Hub either: FileNotFoundError: Dataset 'imagenet-1k' doesn't exist on the Hub. If the repo is private or gated, make sure to log in with `huggingface-cli login`.
@zhaochenyang20
Copy link
Collaborator

@viswavi

@neubig
Copy link
Collaborator

neubig commented Oct 23, 2023

Hey @saum7800, I took a look at this and if you read the error, it says that the dataset may be private or gated.
I looked at the specific dataset, and it seems that this is indeed the case: https://huggingface.co/datasets/imagenet-1k

There are two solutions to this:

  1. Follow the instructions in the error message -- run the hugging face cli and get permission to use the data.
  2. When you get this gated dataset error, gracefully proceed to using the next dataset.

"1." is a solution for this dataset, but you might always run into a new dataset that has problems, so I think "2." will need to be implemented. There are two ways that we could do this:

  1. Simply write a for loop that steps over datasets (in the colab notebook and CLI?) and selects the next one any time the first one fails.
  2. If there is a way to figure out if a dataset is gated through the hugging face API, we could indicate this in our metadata file.

Maybe we could just go with the first option for now.

@neubig neubig changed the title Dataset retriever fails when trying to get config names Dataset retriever fails when trying to download gated dataset Oct 23, 2023
@saum7800
Copy link
Collaborator Author

Right, that makes sense!

In prompt2model_demo.py and .ipynb, user manually selects the dataset number/name. maybe it makes sense to inform the user that the dataset is gated and they should select another one from the retrieved datasets (assuming huggingface allows us to programatically know a dataset is gated). Does that sound right?

@neubig
Copy link
Collaborator

neubig commented Oct 24, 2023

Yep. And in the worst case you could always catch the exception and programmatically parse the error message to see if it indicates that the model is gated.

@ritugala
Copy link
Collaborator

ritugala commented Dec 1, 2023

This will be resolved once reranking PR is merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants