correct docs regarding flash attention #1777

BBC-Esq · 2024-09-10T20:31:22Z

Flash Attention is still listed in the documentation:

https://opennmt.net/CTranslate2/python/ctranslate2.Generator.html

I'd recommend keeping it since you do not have a history of documentation for different versions...and then just add a bullet point mentioning that it's only available for version 4.3.1 or 4.3 or whatever...

Not sure if it's mentioned elsewhere in the documentation, didn't check everywhere...

minhthuc2502 · 2024-09-12T09:26:55Z

Thank you for your feedback. I will update the doc and version handler ASAP.

winstxnhdw · 2024-09-13T18:44:49Z

@BBC-Esq have you seen any benefits to using Flash Attention?

BBC-Esq · 2024-09-13T18:48:54Z

I re-ran some limited testing, including some very large contexts, and did not see much if any improvement - either in tokens/s or VRAM usage. However, I haven't been able to fully test it yet. My previous tests, for whatever reason, indicated a discernable improvement when running models that have a certain architecture (I believe it was Mistral) but not Llama-based for some reason. I'd have to revisit my prior benchmarks and update them to be certain and will post when I get the time to do that.

minhthuc2502 added the documentation Improvements or additions to documentation label Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

correct docs regarding flash attention #1777

correct docs regarding flash attention #1777

BBC-Esq commented Sep 10, 2024 •

edited

Loading

minhthuc2502 commented Sep 12, 2024

winstxnhdw commented Sep 13, 2024

BBC-Esq commented Sep 13, 2024

correct docs regarding flash attention #1777

correct docs regarding flash attention #1777

Comments

BBC-Esq commented Sep 10, 2024 • edited Loading

minhthuc2502 commented Sep 12, 2024

winstxnhdw commented Sep 13, 2024

BBC-Esq commented Sep 13, 2024

BBC-Esq commented Sep 10, 2024 •

edited

Loading