-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Ascend NPU as a backend #1826
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1826
Note: Links to docs will display an error until the docs builds have been completed. ❌ 6 New Failures, 4 Cancelled JobsAs of commit b6332dd with merge base 73aa126 (): NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ca78eca
to
b6332dd
Compare
Hi @ebsmothers, @RdoubleA: I hope you’re doing well! Could you please help me review my code? I would really appreciate it if you could take a look and share any feedback or suggestions. Thank you so much in advance for your time and support! 😊 Best regards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @noemotiovon thanks for the PR! And apologies for the delay in getting to the review here. A couple other questions I have that don't really fit neatly anywhere inline:
- Do we expect compile to work? If so, we should test that. If not, we could raise an error
- Do we expect quant-related APIs (e.g. QLoRA or QAT) from torchao to work? Same as point 1: if so we should test or possibly raise an error
- PyTorch has now released 2.5 as stable. In general we do not claim to support anything but the latest stable release of PyTorch -- do you know the contract on torch_npu releases here?
@@ -45,7 +46,7 @@ def _set_float32_precision(precision: str = "high") -> None: | |||
def verify_bf16_support() -> bool: | |||
""" | |||
Check that bf16 is available on this hardware. Requirements: | |||
- CUDA is available and supports bf16 | |||
- CUDA or NPU is available and supports bf16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand this, requirements for bf16 support on NPU are identical to bf16 support requirements on CUDA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These requirements between NPU and CUDA are similar but not the same, and I will adjust the code comments. Thank you for your valuable feedback!
@@ -617,14 +618,14 @@ def train(self) -> None: | |||
): | |||
break | |||
|
|||
# Start tracking CUDA memory for active steps for just the first epoch | |||
# Start tracking CUDA or NPU memory for active steps for just the first epoch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a nit, but I wonder if we should just generalize these comments to "Start tracking device memory" (otherwise if we add other devices this will start to get pretty verbose)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s a great suggestion! I will adjust the comments to "Start tracking CUDA-like device memory". Thank you very much!
if ( | ||
curr_epoch == 0 | ||
and self.profiler_profile_memory | ||
and idx == self.profiler_wait_steps + self.profiler_warmup_steps | ||
): | ||
torch.cuda.memory._record_memory_history() | ||
get_torch_device().memory._record_memory_history() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you also test this? I am not familiar with NPU memory snapshot APIs but would be good to make sure this works as expected too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NPU has these APIs, but I haven’t tested whether they function as expected, so I’ll roll back these changes for now and address them in a separate PR.
|
||
@pytest.mark.skipif(not cuda_available, reason="The test requires GPUs to run.") | ||
@patch("torch.cuda.is_available", return_value=True) | ||
def test_get_torch_device_for_cuda(self, mock_cuda): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should add a similar test for NPU (with corresponding patch)? Ofc if we don't have the device in our CI runners, maybe it's too trivial to actually be meaningful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, NPU testing has not been considered. I will look into proposing a CI-related PR and the necessary hardware later.
@@ -87,15 +88,15 @@ def __init__( | |||
60 # we should not exceed this percentage of memory | |||
) | |||
|
|||
self.s0 = torch.cuda.default_stream() # comp stream | |||
self.s0 = get_torch_device().default_stream() # comp stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here: do we know that activation offloading will work on NPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NPU has these APIs, but I haven’t tested whether they function as expected, so I’ll roll back these changes for now and address them in a separate PR.
CPU = ("cpu", "CPU", "gloo") | ||
CUDA = ("cuda", "GPU", "nccl") | ||
NPU = ("npu", "NPU", "hccl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need an item for MPS here? cc @SalmanMohammadi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also more of a nit, but I wonder if we should use a NamedTuple abstraction here. (Alternatively can just add a comment explaining what each of the fields correspond to)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for your suggestions! I’ll make sure to add the relevant comments here.
distributed training seems to have problems e.g qat_distributed @noemotiovon |
I would be very happy to! I will contact you via email. |
@noemotiovon through 126 email thanks. Looking forward to your email. |
def is_torch_npu_available() -> bool: | ||
"""Check the availability of NPU""" | ||
try: | ||
import torch_npu # noqa: F401 | ||
|
||
return torch.npu.is_available() | ||
except ImportError: | ||
return False | ||
|
||
|
||
is_npu_available = is_torch_npu_available() | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all redundant after the autoload mechanism landed in PyTorch 2.5.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion! We will make adjustments to this part once torch-npu is updated to version 2.5.0.
def get_torch_device() -> any: | ||
"""Return the corresponding torch attribute based on the device type string. | ||
|
||
Returns: | ||
module: The corresponding torch module, or torch.cuda if not found. | ||
""" | ||
device_type = get_device_support().device_type | ||
try: | ||
return getattr(torch, device_type) | ||
except AttributeError: | ||
print( | ||
f"Device Module '{device_type}' not found in torch, try to load torch.cuda." | ||
) | ||
return torch.cuda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use torch.get_device_module() I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion! We will make adjustments to this part once torch-npu is updated to version 2.5.0.
if device.type in ["cuda", "npu"] and local_rank is not None: | ||
# Ensure device index matches assigned index when distributed training | ||
if device.index != local_rank: | ||
raise RuntimeError( | ||
f"You can't specify a device index when using distributed training. \ | ||
Device specified is {device} but was assigned cuda:{local_rank}" | ||
Device specified is {device} but was assigned cuda/npu:{local_rank}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All npu
can be replaced to torch._C._get_privateuse1_backend_name()
, here are two reasons:
- All privateuse1 backends are CUDA-like devices
- This change will benifit all out-of-tree backends
cc: @FFFrog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion! We will make adjustments to this part once torch-npu is updated to version 2.5.0.
What does this PR do?
Overview
🚀This PR enables the users of
torhtune
to leverage the Ascend NPU for better performance in inferencing when GPU device is not available.For more details, see: [#1797].
Environment
Note
To properly install CANN, see [here] for more details.
The version of
torch-npu
should match that oftorch
, see [here] for more details.In addition,
torch_npu
has a pre-release version, 2.4.0 RC1, which is also the basis for this test. For more information, please visit [here].Examples
To start with, the library
torch_npu
should be correctly installed and imported. Part of the codes are showed below:torchtune/utils/_device_support.py
:Plus, there are some other places of the codes might be adjusted, which won't be too much.
Feel free to leave comments to guide me in further improvements 😊.
Tests
This PR has passed the tests showed below:
Basic Usage Test
Recipe: lora_finetune_single_device
Model: Meta-Llama-3.1-8B-Instruct
Config:
Logs:
Result: The test results demonstrate the successful completion of a single-device LoRA fine-tuning process on the Llama 3.1 8B model. The configuration included a batch size of 30, gradient accumulation over 64 steps, and one epoch of training on an NPU device using the bf16 data type. Activation checkpointing was enabled, and LoRA fine-tuning was applied to attention modules. The process utilized AdamW as the optimizer with a learning rate of 0.0003 and a cosine learning rate scheduler.