RLHF with PPO #1005

SalmanMohammadi · 2024-05-19T13:03:21Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Background reading:

The N Implementation Details of RLHF with PPO, Huang et al.
The N+ Implementation Details of RLHF with PPO: A case study on TL;DR Summarization
The original RLHF paper - Fine-Tuning Language Models from Human Preferences, Ziegler et al.
Anthropic's RLHF paper - Training a Helpful and Harmless Assistant with Learning from Human Feedback
Training language models to follow instructions with human feedback, Ouyang et al.
Shameless plug, but I would have genuinely found this post helpful when I started out with PPO, even for skimming through some of the references - The theory of Proximal Policy Optimization implementations

Changelog:

Implemented LoRA PPO recipe
- Changes
  - I used https://github.com/huggingface/trl/blob/main/trl/trainer/ppov2_trainer.py#L373, https://github.com/huggingface/trl/blob/d1aa0b6b2c8dfd78c0f771759d1ff2469c0e5ed2/trl/trainer/ppo_trainer.py and https://github.com/vwxyzjn/lm-human-preference-details/blob/main/lm_human_preference_details/train_policy_accelerate.py as references.
- Tests:
  - TODO
Refactored TransformerDecoder
- Changes
  - The following changes were added as a submodule in torchtune.models.mistral:
    - Changed TransformerDecoder to TransformerDecoderWithHiddenLayer
    - Added TransformerLM which wraps an output projection around TransformerDecoderWithHiddenLayer
      - Component and model builders now return TransformerLM
    - Added TransformerLMWithValueHead, with two linear projections: one for the LM head and one for the value head.
  - Support for checkpointing with models wrapping around TransformerDecoderWithHiddenLayer added
  - Updated checkpointing to correctly convert HF weights to refactored models
  - Added support for checkpointing value heads.
- Tests:
  - TODO
Added mistral value head models
- Changes:
  - Added model and component builders
- Tests
  - The implementation is identical to MistralClassifier.
Added PPOLoss
- Tests:
  - test_ppo_loss tests for correct behaviour based on expected relative value and policy loss for different inputs.
Added utils.ppo_utils for various ppo utils, and tests for all files including:
- _generation.py
  - Added custom_generate_next_token functions for generating with value head models, and for generating with masks and input positions.
  - Added get_causal_masks for creating masks of shape [bsz, seq_len, seq_len] which correctly mask leading padding tokens, suitable for use with scaled_dot_product_attention.
  - Added a custom generate function which generates sequences using above functionality.
- collate.py
  - Added support for collating input sequences by left-padding to a specified maximum sequence length.
- rewards.py
  - Support for calculation of rewards, advantage estimation, adaptive and fixed KL controllers, and reward normalisation.

TODO:

Complete or remove all TODO (@SalmanMohammadi)
Remove temporary changes for MPS support.
Upload randomly initialised models and write recipe tests for:
- Verifying recipe checkpointing works correctly w.r.t. saving and loading base model and value head weights.
- Verifying expected loss values.
Run full model training and verify loss curves
Add support for reward models and base models using different tokenizers.
Add docs to api reference.

Adding this to open up discussion and get some feedback (@kartikayk) while I train models and verify correctness. Maybe a good place to start would be the TransformerDecoder refactor?

closes #812

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

…ajectory generation, tests for advantage and return estimation

… builders

…ng from checkpoints

pytorch-bot · 2024-05-19T13:03:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1005

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 4e6be43 with merge base 5019074 ():

NEW FAILURE - The following job has failed:

Lint / lint (3.10) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…, adding support for saving value head checkpoints

…h padded inputs, added tests for new generation

…for generation

…rejection sampling and reward model masking, moved utils

…tion sampling masking, tests for get_causal_mask

…l files, added tests for ppo collation

recipes/lora_ppo_single_device.py

torchtune/utils/ppo/_generation.py

…loss coefficient and refactored kl controllers

torchtune/modules/rlhf/_generation.py

joecummings

This is shaping up nicely - I very much like that we de-scoped this to focus on full finetune first with lora and qlora as follow-ups. A couple high level things to note:

I'm concerned about the configs becoming too bloated and would like to discuss how to minimize storing lots of logic there.
What are the largest size model you can fit in 80G A100? I see you include configs for both 7B and 1B?

SalmanMohammadi · 2024-07-17T09:22:35Z

Thanks for another review.

I'm concerned about the configs becoming too bloated and would like to discuss how to minimize storing lots of logic there.

I feel this. One thing that stuck out to me when writing this - we currently need 4 checkpointers, two of which are solely used to point to the original weights for the policy and reward models, respectively. They're necessary because you need the reference to the original weights when resuming training, and the choice to me at the time was managing this state in the config vs the checkpoints.

The model definitions are also taking up a lot of space, but that's largely because I didn't see another obvious way to configure a 1B Llama2. The model definition in the Mistral config is annoying because that specific reward model uses a different vocab size. Please let me know if I can make this cleaner!

There's also ~30 lines for hyperparameters in the config. Hopefully this won't be overwhelming to the user once we include a cookbook. I could remove 5 or so of these from the config and set as defaults in the recipe.

EDIT: I could also liberally use cfg.get to set the recipe up with default hyperparameter valuex which could work for most use-cases and expose them in the recipe docs instead.

What are the largest size model you can fit in 80G A100? I see you include configs for both 7B and 1B?

The training run in my above comment trained Mistral 7B on an 80GB A100.

codecov-commenter · 2024-07-18T10:47:23Z

Codecov Report

Attention: Patch coverage is 52.35546% with 445 lines in your changes missing coverage. Please review.

Project coverage is 67.96%. Comparing base (43c7332) to head (ba365a8).
Report is 4 commits behind head on main.

Files	Patch %	Lines
recipes/ppo_full_finetune_single_device.py	0.00%	330 Missing ⚠️
...ts/recipes/test_ppo_full_tunetune_single_device.py	16.19%	88 Missing ⚠️
torchtune/modules/rlhf/collate.py	45.00%	11 Missing ⚠️
torchtune/modules/rlhf/rewards.py	85.45%	8 Missing ⚠️
tests/recipes/utils.py	33.33%	4 Missing ⚠️
torchtune/modules/rlhf/sequence_processing.py	84.61%	2 Missing ⚠️
recipes/lora_dpo_distributed.py	0.00%	1 Missing ⚠️
recipes/lora_dpo_single_device.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1005      +/-   ##
==========================================
- Coverage   69.32%   67.96%   -1.37%     
==========================================
  Files         233      246      +13     
  Lines       10593    11434     +841     
==========================================
+ Hits         7344     7771     +427     
- Misses       3249     3663     +414

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SalmanMohammadi · 2024-07-25T13:13:51Z

bump bump @joecummings @ebsmothers.

My dearest reviewers,

Sorry to ping you when you're busy. In my defense, @kartikayk did tell me to. What can I do to help move this along? I'm more than happy to help reduce the review overhead if I can.

SalmanMohammadi · 2024-08-02T16:27:50Z

Outstanding discussions/tasks:

@joecummings @ebsmothers

@ebsmothers can you please upload my reward model to S3 so my recipe tests pass? See the PPO channel on Discord. Let me know if you need a refresh on this.
@joecummings pointed out that the config is a bit unweildy. I think I could address this by removing a bunch of parameters from the config and setting sensible defaults, and then correctly documenting them in a recipe doc (which might look like [RFC][DOCS] Recipe [DOCS] ([DOC]umentation) #1230). Can we leave this as a follow up, though? 🥺
Are you happy with my replication results here? RLHF with PPO #1005 (comment)
Can we also leave generalising the generation utils to a follow up? Joe has started discussion in Fix generation for bsz > 1 #1250.
My use of rng generator checkpointing (RLHF with PPO #1005 (comment)) is unprecedented in the codebase. Are you happy with this? I'm just adding another key to the checkpoint.
Do you care about generalizing the RLHF collation utils to the torchtune.utils.collate? (Joe's comment here and here They currently aren't being used outside of the PPO recipe itself, and the DPO collation isn't being used outside the DPO recipe.

ebsmothers · 2024-08-02T22:04:16Z

tests/test_utils.py

@@ -29,6 +29,7 @@
    "llama2_tune": "/tmp/test-artifacts/small-ckpt-tune-03082024.pt",
    "llama2_meta": "/tmp/test-artifacts/small-ckpt-meta-03082024.pt",
    "llama2_hf": "/tmp/test-artifacts/small-ckpt-hf-03082024.pt",
+    "llama2_reward_hf": "/tmp/test-artifacts/small-ckpt-hf-reward-12072024.pt",  # TODO (SalmanMohammadi)


Sorry for being an American chauvinist but I changed the filename to small-ckpt-hf-reward-07122024.pt (really just want to make it consistent with the format of the other ones). Also I think you will need to update cache_artifacts.sh correspondingly

Oh how the Empire has fallen from grace.

ebsmothers · 2024-08-03T00:43:35Z

General comment on the checklist you left earlier: all the points look good to me, let's just file tasks for some of the more important todos that don't have them already.

Also, leaving some miscellaneous remarks here in response to several of your previous comments:

I ran the experiment on an A100 - the default config in the repo includes the memory optimization parameters needed to make this work. I used optimizer_in_bwd and PagedAdamW. Training was pretty slow at the start, but I was seeing >10x speedups once torch.compile kicked in.

Looking at the figures seems this is necessary even for A100? Since you are still pretty close to 80GB allocated memory. I'm also curious whether the overall training speed is decent as these configs can slow things down quite a bit.

Can we also leave generalising the generation utils to a follow up? Joe has started discussion in #1250.

Just want to confirm: will we actually be able to run this recipe without batched generation support?

ebsmothers

OK a bunch more comments but after that there are no major concerns. Home stretch here -- thanks again for your immense patience on this one

torchtune/utils/pooling.py

torchtune/modules/rlhf/collate.py

torchtune/utils/collate.py

torchtune/modules/loss/ppo.py

recipes/ppo_full_finetune_single_device.py

ebsmothers · 2024-08-03T00:39:22Z

recipes/ppo_full_finetune_single_device.py

+            (seq_lens > 0) & (seq_lens < self._max_generated_tokens - 1),
+            seq_lens + 1,
+            seq_lens,


Sorry not sure I fully follow what the purpose of this is

...
...
...
Dare I say..... excalidraw?

In all seriousness, I can send you an equally confusing diagram from my notes on Discord. This took me a while to wrap my head around, and longer to explain coherently (disclaimer, this could just all be wrong, since my only reference is a single line from a Learning to Summarize implementation), so, thanks for the nerd snipe.

The TL;DR - the value function is estimating the return for the whole sequence at each step, which is the reward model score for the (query, truncated response), plus the KL per-token penalty. We want to use this for the advantage estimation, and the advantage for the last action taken (the last valid non-padding token generated by the model), is:

So, we need the value estimate (return) for the sequence up to now, plus one step ahead. For the last token, this means we need extend the padding mask out by one for the values - instead of masking everything after the last non-padding token, we mask everything one value after the last non-padding token.

These three lines do this, but just add some logic to say if we're already at the end of the sequence then we don't need to extend the mask.

Thanks, I think this makes sense (though I reserve the right to be confused again later on)

recipes/ppo_full_finetune_single_device.py

SalmanMohammadi · 2024-08-05T00:37:41Z

O' what glorious reviews. Thank you. I've addressed your comments.

Looking at the figures seems this is necessary even for A100? Since you are still pretty close to 80GB allocated memory. I'm also curious whether the overall training speed is decent as these configs can slow things down quite a bit.

For 7B, since we're fitting 4x7B models in it took a little wrangling to fit it all. The run I posted took around ~3 hours. I haven't found any tests to benchmark against here on comparable hardware, to estimate appropriate speed/memory usage. DeepSpeed's RLHF states:

Theoretically, the largest model you can train for this step is similar to the step-1 SFT finetuning if you enable

zero stage 3 (if you use multiple GPUs)
gradient checkpoint
LoRA
reference model offloading.

However, in practice, this is not always the case, and we are still investigating the reasons behind it. For now, we suggest that users use "Total-GPU-Memory-in-GB / 6" as the upper parameter bound in billions for the sum of the actor model and critical model, for safety. Nevertheless, users are welcome to try the real limit.

Not 100% clear on whether the upper parameter bound calculation requires the specific config they listed, but going by that, their method offers ~13GB for the sum of the actor and critic models, max, on an 80GB A100. TRL trained Pythia 6.9B on their PPOV2 trainer with 8xH100.

Just want to confirm: will we actually be able to run this recipe without batched generation support?

The generation utils I include in this PR do provide batched generation support.

ebsmothers

Thank you for your immense patience on this one. I left a couple of other follow-up comments, but none of them are blocking us from landing this. 🚀

SalmanMohammadi added 8 commits May 9, 2024 23:40

Refactoring TransformerDecoder and adding value-head transformers

11d88a2

adding ppo config and recipe to registry

2849ec5

Merge branch 'pytorch:main' into ppo

f0c1410

implemented ppo recipe structure, advantage and return estimation, tr…

57c67bf

…ajectory generation, tests for advantage and return estimation

finished first pass implementation of ppo. added tests for ppo loss

03cba4b

reverting changes

f50f047

adding lora to ppo recipe, adding lora value head component and model…

b034af7

… builders

added lora training, added value head checkpointing and recipe resumi…

466b683

…ng from checkpoints

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 19, 2024

SalmanMohammadi added 2 commits May 21, 2024 12:49

removing test model builders, adding batched generation to ppo recipe…

928037d

…, adding support for saving value head checkpoints

fixing bug in _checkpointer.py

68b6162

SalmanMohammadi mentioned this pull request May 24, 2024

[RFC] TransformerDecoder refactor #1017

Closed

SalmanMohammadi added 9 commits May 30, 2024 14:00

Adding support for user-provided masks in attention

65ca12a

Merge branch 'pytorch:main' into ppo

9d8c5a8

merging transformer custom masking, adding support for generation wit…

b99102c

…h padded inputs, added tests for new generation

adding functionality for truncation in generation, and further tests …

a1cde1c

…for generation

updated lora recipe to use custom generation

b032778

Merge branch 'pytorch:main' into ppo

f126e9a

added support for correct truncation and padding of responses, added …

04d514a

…rejection sampling and reward model masking, moved utils

added correct mask and position id trajectory generation, score rejec…

4854908

…tion sampling masking, tests for get_causal_mask

bugfixing in ppo recipe. refactoring ppo_utils and tests to individua…

c885833

…l files, added tests for ppo collation

SalmanMohammadi commented Jun 8, 2024

View reviewed changes

recipes/lora_ppo_single_device.py Outdated Show resolved Hide resolved

SalmanMohammadi commented Jun 8, 2024

View reviewed changes

torchtune/utils/ppo/_generation.py Outdated Show resolved Hide resolved

SalmanMohammadi commented Jun 8, 2024

View reviewed changes

torchtune/utils/ppo/_generation.py Outdated Show resolved Hide resolved

updating ppo_utils namespace

57d57fa

SalmanMohammadi marked this pull request as ready for review June 8, 2024 15:00

SalmanMohammadi added 3 commits June 10, 2024 21:13

fixing bug in collation, updating loss tests

cce5548

bugfixes in masking and indexing logprobs and values, added fixed kl …

c289566

…loss coefficient and refactored kl controllers

added loss and value masking

a3fa1ea

joecummings reviewed Jul 16, 2024

View reviewed changes

torchtune/modules/rlhf/_generation.py Show resolved Hide resolved

joecummings reviewed Jul 16, 2024

View reviewed changes

detaching losses for metric logging

1fbb6dc

SalmanMohammadi added 3 commits July 25, 2024 15:45

removing 1b, merging main

65ef9dc

merging

c7bbff1

deleting logits in loss

1129f9e

bhack mentioned this pull request Jul 30, 2024

Vision/Multimodal #791

Open

Merge branch 'main' into ppo

fe87dfb

SalmanMohammadi added 2 commits August 2, 2024 17:28

cleaning conf

662ab2c

pYdOcLiNt

76b124f

ebsmothers reviewed Aug 2, 2024

View reviewed changes

downloading weights

dc4887c

ebsmothers reviewed Aug 3, 2024

View reviewed changes

SalmanMohammadi added 2 commits August 5, 2024 01:55

addressing comments

ef85dba

updating test

fd87fe6

SalmanMohammadi requested a review from ebsmothers August 5, 2024 10:29

This was referenced Aug 5, 2024

Organize our collation utils #1261

Closed

[Docs] Write PPO and DPO recipe docpages #1262

Open

ebsmothers approved these changes Aug 5, 2024

View reviewed changes

SalmanMohammadi added 3 commits August 5, 2024 16:32

let's finish this the way we started... together

ba365a8

Merge branch 'main' into ppo

e76304c

lInTiNG

4e6be43

SalmanMohammadi merged commit c593c10 into pytorch:main Aug 5, 2024
3 checks passed

SalmanMohammadi mentioned this pull request Aug 19, 2024

Undo commenting out NF4/bitsandbytes tests #1370

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RLHF with PPO #1005

RLHF with PPO #1005

SalmanMohammadi commented May 19, 2024 •

edited

Loading

pytorch-bot bot commented May 19, 2024 •

edited

Loading

joecummings left a comment •

edited

Loading

SalmanMohammadi commented Jul 17, 2024 •

edited

Loading

codecov-commenter commented Jul 18, 2024 •

edited

Loading

SalmanMohammadi commented Jul 25, 2024

SalmanMohammadi commented Aug 2, 2024 •

edited by ebsmothers

Loading

ebsmothers Aug 2, 2024

SalmanMohammadi Aug 4, 2024

ebsmothers commented Aug 3, 2024

ebsmothers left a comment

ebsmothers Aug 3, 2024

SalmanMohammadi Aug 3, 2024

SalmanMohammadi Aug 4, 2024 •

edited

Loading

ebsmothers Aug 5, 2024

SalmanMohammadi commented Aug 5, 2024 •

edited

Loading

ebsmothers left a comment

RLHF with PPO #1005

RLHF with PPO #1005

Conversation

SalmanMohammadi commented May 19, 2024 • edited Loading

Context

Background reading:

Changelog:

TODO:

pytorch-bot bot commented May 19, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1005

❌ 1 New Failure

joecummings left a comment • edited Loading

Choose a reason for hiding this comment

SalmanMohammadi commented Jul 17, 2024 • edited Loading

codecov-commenter commented Jul 18, 2024 • edited Loading

Codecov Report

SalmanMohammadi commented Jul 25, 2024

SalmanMohammadi commented Aug 2, 2024 • edited by ebsmothers Loading

ebsmothers Aug 2, 2024

Choose a reason for hiding this comment

SalmanMohammadi Aug 4, 2024

Choose a reason for hiding this comment

ebsmothers commented Aug 3, 2024

ebsmothers left a comment

Choose a reason for hiding this comment

ebsmothers Aug 3, 2024

Choose a reason for hiding this comment

SalmanMohammadi Aug 3, 2024

Choose a reason for hiding this comment

SalmanMohammadi Aug 4, 2024 • edited Loading

Choose a reason for hiding this comment

ebsmothers Aug 5, 2024

Choose a reason for hiding this comment

SalmanMohammadi commented Aug 5, 2024 • edited Loading

ebsmothers left a comment

Choose a reason for hiding this comment

SalmanMohammadi commented May 19, 2024 •

edited

Loading

pytorch-bot bot commented May 19, 2024 •

edited

Loading

joecummings left a comment •

edited

Loading

SalmanMohammadi commented Jul 17, 2024 •

edited

Loading

codecov-commenter commented Jul 18, 2024 •

edited

Loading

SalmanMohammadi commented Aug 2, 2024 •

edited by ebsmothers

Loading

SalmanMohammadi Aug 4, 2024 •

edited

Loading

SalmanMohammadi commented Aug 5, 2024 •

edited

Loading