Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paxml c4 resplit dataset permission issues #764

Closed
gramesh-amd opened this issue Aug 27, 2024 · 6 comments
Closed

Paxml c4 resplit dataset permission issues #764

gramesh-amd opened this issue Aug 27, 2024 · 6 comments

Comments

@gramesh-amd
Copy link

Paxml training instructions provide link to gcs bucket to get the 3.0.4 resplit for mlperf. But I dont think its publicly accessible

gsutil -u 'gcp_project_name' -m cp 'gs://mlperf-llm-public2/c4/en/3.0.4' gives me permission error

@gramesh-amd gramesh-amd changed the title c4 resplit dataset permission issues Paxml c4 resplit dataset permission issues Aug 27, 2024
@gramesh-amd
Copy link
Author

cc: @ShriyaPalsamudram

@ShriyaPalsamudram
Copy link
Contributor

ShriyaPalsamudram commented Aug 27, 2024

@sgpyc could you please fix the instructions so the paths now point to the S3 bucket instead? This PR does the same for the megatron-lm reference

@gramesh-amd in the meantime, can you use these instructions which should also have the paxml versions of the data/ckpts.

@gramesh-amd
Copy link
Author

@ShriyaPalsamudram Thanks for the quick reply

I did read through the instructions page.

They seem to point to the same gs bucket for paxml checkpoint (gs://mlperf-llm-public2/gpt3_spmd1x64x24_tpuv4-3072_v84_20221101/checkpoints/checkpoint_00004000) and training dataset (gs://mlperf-llm-public2/c4/en_val_subset_json/c4-validation_24567exp.json)

Both these paths result in permissions issues for me

@ShriyaPalsamudram
Copy link
Contributor

@gramesh-amd Can you specifically follow the S3 artifacts download section which does not point to the gs bucket?

Once you setup rclone, you can investigate mlc-training:mlcommons-training-wg-public/gpt3/ path which should have both paxml and megatron-lm dataset and ckpt artifacts. So everything needed to run the references should be available in the S3 bucket

@gramesh-amd
Copy link
Author

Thanks
Let me go through it and reopen this if there is any trouble

@gramesh-amd
Copy link
Author

@ShriyaPalsamudram @sgpyc the above steps lets me download the gpt3 paxml ckpt but i cant access the 3.0.4 train/validation splits of c4 mlperf. The links mentioned in pax page doesnt work

Could you please let me know the updated links?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants