-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A process in the process pool was terminated abruptly while the future was running or pending. #8234
Comments
It is from
which version of pytorch/xla you are using? I am wondering how did you still trigger the XRT runtime which has been deprecated for a long time. |
import torch_xla 2.4.0+libtpu |
I just did a simple test on TPU v3-8 with the torch_xla 2.4.0 version:
You should see it prints out 0, 1, ...7, which means it uses 8 cores. |
@zpcore than you, I see the code running on all tpu core on kaggle v3-8 now. But the new problem is after torch_xla.distributed.xla_multiprocessing.spawn(process, start_method='fork') I just can see one process return successfully. All others are broken. But when I run single process it work. Is it possible to limit to just two cores. So I can debug fast. |
In TPU, you can only use all the process with leaving the |
@zpcore I figure out the problem, the StableDiffusionControlNetPipeline is very big and occupy many memory. If I use 8 process, I create 8 StableDiffusionControlNetPipeline pipeline. So that they eat up all the memory. I have no idea how to reduce the memory, currently I use
to test. But not working
Do you have any idea how to reduce the memory? |
@zpcore the following code
get Your notebook tried to allocate more memory than is available. It has restarted. and
|
I didn't find the complain message about the memory. It looks like TPU doesn't have the data. Can you try
Instead of |
this code is not working, it get
Do you know what is the correct way to run StableDiffusionControlNetPipeline on kaggle v3-8 tpu |
@zpcore @JackCaoG
But the output is
the output is really not help, Do you have any idea how to make the code working? |
❓ Questions and Help
I want to run pytorch xla on kaggle tpu v3-8 and use all core in tpu. But I always get A process in the process pool was terminated abruptly while the future was running or pending.
Source code:
and get
Single process on single tpu core works well but can not work multiple process and use all cores on tpu.
Please help.
You can copy and test my code in https://www.kaggle.com/code/chaowenguoback/stablediffusion. Please help
The text was updated successfully, but these errors were encountered: