Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the expected inference steps after I apply torchao in training? #638

Open
goldhuang opened this issue Oct 21, 2024 · 3 comments
Open
Labels
question Further information is requested

Comments

@goldhuang
Copy link

goldhuang commented Oct 21, 2024

Hello, I have integrated torchao to my training. But I think it's not very clear what the inference should be like.
Should I use the converted FP8 linear layer to do inference? Is delayed scaling supposed to work in inference?
Or, should I use the original linear layer to do inference?

Thanks in advance if you can help to clarify!

@tianyu-l
Copy link
Contributor

cc: @weifengpy @vkuzo

@tianyu-l tianyu-l added the question Further information is requested label Oct 21, 2024
@kwen2501
Copy link
Contributor

kwen2501 commented Oct 21, 2024

Do you need Distributed Inference? Or are you doing inference on single GPU?

  1. For single GPU, I think using original model definition + loading quantized weights should ideally "just work", @vkuzo to confirm. If not, please file a RFC in ao.

  2. For Distributed Inference, we are building DTensor + Quantized Tensor support in torchchat. (We are yet to publish a demo.) There is also a simple ao + TP example in the ao repo: link.

@goldhuang
Copy link
Author

Do you need Distributed Inference? Or are you doing inference on single GPU?

  1. For single GPU, I think using original model definition + loading quantized weights should ideally "just work", @vkuzo to confirm. If not, please file a RFC in ao.
  2. For Distributed Inference, we are building DTensor + Quantized Tensor support in torchchat. (We are yet to publish a demo.) There is also a simple ao + TP example in the ao repo: link.

If original model definition is working for single GPU, that means I could just use my current distributed inference, is that right? The ao + TP example looks like to use torchao converted FP8 linear layer to do inference, which is different from what you suggest for single GPU. To be honest, I feel a little bit confused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants