-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] SLM running serve on known-good chat model crashes on _print_kv_cache_metadata_in_json
#1921
Comments
@MasterJH5574 Recompiling the model with the latest nightly triggered the following error:
|
Thank you! This was fixed just an hour ago 9df8f03 |
K - will try it out again after nightly 🙏 |
@MasterJH5574 I hand patched the changes - and it compiled the model fine ! 🙏 However, when I invoke serve again with the newly compiled model - there is still the problem with "print kvcache in JSON" failing -- it seems to be an injected argument to my simpler invocation ....
|
Ah just noted that you are running Mistral. The mistral support in MLC serve is still on going. We will get it done within one week. Meanwhile, the fix mentioned above will address the issue for compiling Mistral for chat use in Python. |
That's great. Thanks @MasterJH5574 Is the Llama2 serve completed now? (or which one can I test instead) 🙏 |
Hmm. Quick test of Llama2-7b shows chat model won't compile at this time (thanks to the quick new flow). Thanks, I will track github for progress. |
Yeah supporting JIT compiling in serve is what we are also working on. Ideally it should be finished within one week. At this moment we still need to manually compile the models first. The serve flow for Llama2 is completed. Here is a gist on how we can launch server for Mixtral for your reference https://gist.github.com/MasterJH5574/ea1ba901938338a27434c18cdcd3f935 We will update the documentations soon so that we don't need to share gist every time 😂 |
Ah that explains everything 🙏 I'll close the other issue. And will wait for JIT inclusion in serve and the CLI update. I intend to update the docker PR to use this latest flow as soon as it is ready - #1271 |
Just want to give a (late) update here. The serve CLI and JIT was finished in #2014, thanks to @shreygupta2809 and @Kartik14. I think we are good to conclude this issue now :-) |
Awesome! Thanks @MasterJH5574 @shreygupta2809 and @Kartik14 The is amazing. I tested with Llama2, Mistral, and Gemma they all worked well. But did crash with phi as well as openfunction v1/v2 - tagged as separate issue #2113 |
🐛 Bug
When I invoke
serve.server
with mistral-7b supplying the two required arguments, it crashes when trying to start theasync_engine
within_print_kv_cache_metadata_in_json
To Reproduce
Steps to reproduce the behavior:
mlc_chat chat
to generate/discover the lib name to mistral-7b on your system:Serve will crash on
_print_kv_cache_metadata_in_json
Expected behavior
Server running servicing request.
Environment
conda
, source): condapip
, source): pippython -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):Additional context
This is the trace on the crashing run:
The text was updated successfully, but these errors were encountered: