Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to do continue training when a job failed #105

Open
viyjy opened this issue Dec 13, 2023 · 1 comment
Open

How to do continue training when a job failed #105

viyjy opened this issue Dec 13, 2023 · 1 comment

Comments

@viyjy
Copy link

viyjy commented Dec 13, 2023

Hi, for example I am training a job using this yaml, how to do continue training if this job failed? Thanks.

@coryMosaicML
Copy link
Collaborator

You can add a load path as a trainer argument in that yaml to resume a job from an earlier checkpoint.

Something like this:

trainer:
  _target_: composer.Trainer
  device: gpu
  max_duration: 850000ba
  eval_interval: 10000ba
  device_train_microbatch_size: 16
  run_name: ${name}
  seed: ${seed}
  load_path: # Path to checkpoint to resume training from
  save_folder:  # Insert path to save folder or bucket
  save_interval: 10000ba
  save_overwrite: true
  autoresume: false
  fsdp_config:
    sharding_strategy: "SHARD_GRAD_OP"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants