You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd recommend keeping it since you do not have a history of documentation for different versions...and then just add a bullet point mentioning that it's only available for version 4.3.1 or 4.3 or whatever...
Not sure if it's mentioned elsewhere in the documentation, didn't check everywhere...
The text was updated successfully, but these errors were encountered:
I re-ran some limited testing, including some very large contexts, and did not see much if any improvement - either in tokens/s or VRAM usage. However, I haven't been able to fully test it yet. My previous tests, for whatever reason, indicated a discernable improvement when running models that have a certain architecture (I believe it was Mistral) but not Llama-based for some reason. I'd have to revisit my prior benchmarks and update them to be certain and will post when I get the time to do that.
Flash Attention is still listed in the documentation:
https://opennmt.net/CTranslate2/python/ctranslate2.Generator.html
I'd recommend keeping it since you do not have a history of documentation for different versions...and then just add a bullet point mentioning that it's only available for version 4.3.1 or 4.3 or whatever...
Not sure if it's mentioned elsewhere in the documentation, didn't check everywhere...
The text was updated successfully, but these errors were encountered: