-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
7B模型在4*A100 80GB上发生OOM #382
Comments
我也是类似的报错,你最后解决了吗 @Rocky77JHxu |
我最终采用了ms-swift框架来确保能够先完成任务。 |
噢噢感谢!但是我用ms-swift也有类似的错误,用的命令是 |
在.env文件中设置本地模型部署服务的 |
当我针对
InternLM-XComposer2_5-7B
模型进行评估时,出现了OOM,配置是4*A100 80GB。我观察到执行过程送入query进入模型的速度非常的快,发生OOM的原因是否和一次性送入的batch有关?刚开始的时候,每块GPU显存在20~70GiB上下疯狂跳动,而在第12轮左右便发生了OOM。但是同样的4*A100 80GB的硬件条件下,评估
InternVL2-40B
竟然没有任何问题,显存也很稳定的在45GB左右。不过执行的速度很慢,似乎在InternVL2-40B
中每次就送入一个batch。这很奇怪,如果是batch问题我应该如何修改它?我尝试修改过
${VLMEvalKit}/vlmeval/vlm/xcomposer/xcomposer2d5.py
的代码,但是并不奏效,我也没发现batch是在哪里实现的。如果不是batch问题,我应该如何让7B的模型能够完成评估?
我的执行命令是:
报错信息:
此外,我76B的模型仍然也会有OOM的情况。我尝试先将 76B 模型利用 LMDepoly 部署成 openai 接口,再接入到 VLMEvalKit 评测框架中,但是发现失败了。报错是:
2024-08-13 16:48:22,299 - ChatAPI - ERROR - HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x2ae04fd4b400>: Failed to resolve 'openaipublic.blob.core.windows.net' ([Errno -2] Name or service not known)"))
The text was updated successfully, but these errors were encountered: