Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miscellaneous CI, dependency, and version fixes #1151

Merged
merged 10 commits into from
Jul 9, 2024

Conversation

ebsmothers
Copy link
Contributor

@ebsmothers ebsmothers commented Jul 9, 2024

Consolidating a bunch of small changes into this PR:

  • run unit tests on GPUs in our CI
  • define a util for ao version checks (this will come in handy once Make version check equal to or more than instead of more than ao#485 lands)
  • remove torchvision from pyproject.toml and add separate install instructions to fix this issue with forced downgrade of PyTorch nightlies
  • run CI on the combo of ao + PyTorch nightlies
    • Another thing we could consider is to remove ao from our pyproject.toml, similar to what I've added here for torchvision, then give the separate install instructions depending on whether users want nightly or stable versions. Otherwise it's a bit hacky because we technically install both stable and nightly versions of ao in our install flow (but nightly will supersede stable). Punting this for now since I don't think it'll break anything

CI should be green

Copy link

pytorch-bot bot commented Jul 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1151

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9b94fa1 with merge base 26b54b2 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 9, 2024
@ebsmothers ebsmothers changed the title [wip] CI: drop 3.8, run on ao nightly, better ao version checks Miscellaneous CI, dependency, and version fixes Jul 9, 2024
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 72.97297% with 10 lines in your changes missing coverage. Please review.

Project coverage is 68.01%. Comparing base (06a125e) to head (189c409).
Report is 2 commits behind head on main.

Files Patch % Lines
torchtune/modules/low_precision/_utils.py 64.00% 9 Missing ⚠️
torchtune/utils/quantization.py 83.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1151       +/-   ##
===========================================
+ Coverage   26.76%   68.01%   +41.25%     
===========================================
  Files         205      213        +8     
  Lines        9301     9633      +332     
===========================================
+ Hits         2489     6552     +4063     
+ Misses       6812     3081     -3731     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ebsmothers ebsmothers marked this pull request as ready for review July 9, 2024 05:29
from torchtune.modules.low_precision._utils import _get_torchao_version

ao_version, is_nightly = _get_torchao_version()
print(ao_version, is_nightly)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this before landing?

@@ -10,8 +10,6 @@ authors = [
]
keywords = ["pytorch", "finetuning", "llm"]
dependencies = [
# multimodality
"torchvision",
Copy link
Contributor

@felipemello1 felipemello1 Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a way to keep torchvision without causing issues with torch nightlies. Do you know if its worth researching, or there is no way to make it work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just chatted with @NicolasHug on this and he confirmed it's not possible since there's no way to point pyproject.toml to a specific conda channel or PyPI repo

Copy link
Contributor

@felipemello1 felipemello1 Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it! i am just afraid that that in the long run it may cause issues, since we may want to pin torchvision version. For example, in ClipTransforms, older versions will break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be installed in a single command, this is why I updated the readme to clarify this. So if the user runs pip install torch torchvision they will get stable versions of both torch and torchvision; if they run pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu121 they will get torch nightly and torchvision 0.20.

I can't find the link to the original comment but I believe that was demonstrating what happens when we install using the existing pyproject.toml. For reference, here is my pip list after running the first command; here is my pip list after running the second command. You can see that the versions are as expected.

Btw regarding pinning versions -- we do not test on anything older than the latest stable version of PyTorch and I don't think we want to worry about breaking folks on older versions than that. By the same logic, I don't think we should be pinning to older versions of torchvision. Then the best way to keep things in sync is just install the two together using these commands.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the links and explanation!

@@ -262,6 +262,10 @@ def world_size(self) -> int:
return 2

@gpu_test(gpu_count=2)
@pytest.mark.skipif(
version.parse(torch.__version__).base_version < "2.4.0",
reason="torch >= 2.4 required",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason we don't want to test this with torch 2.3.x?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the DTensor APIs we use in load_from_full_model_state_dict were not stable prior to 2.4. We already addressed this for the QLoRA state dict test in #1087. In this case it's OK because we are testing FSDP2 functionality which is not available until 2.4 anyways. cc @weifengpy in case I'm missing any important points here.

@@ -1,4 +1,4 @@
name: Multi-GPU Recipe Tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the rename here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's no longer just a recipe test, right? Now it's recipe + unit test

@@ -4,6 +4,7 @@ on:
schedule:
# Runs at midnight every day
- cron: '0 0 * * *'
workflow_dispatch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay

README.md Outdated

```
# Install stable version of PyTorch using pip
pip3 install torch torchvision
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to explicitly say pip3 anymore

Copy link
Contributor

@kartikayk kartikayk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we align on AO being an optional dependency? If so, why not do what we do with BnB example and ask users to manually install?

README.md Outdated
@@ -156,7 +156,16 @@ You can find a full list of all our Llama3 configs [here.](recipes/configs/llama

## Installation

**Step 1:** [Install PyTorch](https://pytorch.org/get-started/locally/). torchtune is tested with the latest stable PyTorch release as well as the preview nightly version.
**Step 1:** [Install PyTorch](https://pytorch.org/get-started/locally/). torchtune is tested with the latest stable PyTorch release as well as the preview nightly version. For multimodality
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Step 1:** [Install PyTorch](https://pytorch.org/get-started/locally/). torchtune is tested with the latest stable PyTorch release as well as the preview nightly version. For multimodality
**Step 1:** [Install PyTorch](https://pytorch.org/get-started/locally/). torchtune is tested with the latest stable PyTorch release as well as the preview nightly version. For fine-tuning the multimodal LLMs available in the repo, you'll need to install torchvision as well

@ebsmothers
Copy link
Contributor Author

Did we align on AO being an optional dependency? If so, why not do what we do with BnB example and ask users to manually install?

@kartikayk I think we aligned on actually testing ao nightlies but not on having it as an optional dependency. So I'm doing the former here and not the latter. But cc @msaroufim @joecummings if either of you have thoughts on this

@ebsmothers ebsmothers merged commit 37636a8 into pytorch:main Jul 9, 2024
29 checks passed
@ebsmothers ebsmothers deleted the ao-updates branch July 9, 2024 19:22
maximegmd pushed a commit to maximegmd/torchtune that referenced this pull request Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants