Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error doing composer run.py --config-path yamls/hydra-yamls --config-name SD-2-base-512.yaml #29

Open
wangmiaowei opened this issue Jun 13, 2023 · 4 comments

Comments

@wangmiaowei
Copy link

composer run.py --config-path yamls/hydra-yamls --config-name SD-2-base-512.yaml
[2023-06-13 20:29:52,077][composer.utils.reproducibility][INFO] - Setting seed to 17
Error executing job with overrides: []
Error in call to target 'diffusion.models.models.stable_diffusion_2':
TypeError("UNet2DConditionModel.init() got an unexpected keyword argument 'dual_cross_attention'")
full_key: model

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
ERROR:composer.cli.launcher:Rank 3 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 3 (PID 40553) exited with code 1
----------Begin global rank 3 STDOUT----------
[2023-06-13 20:29:52,032][composer.utils.reproducibility][INFO] - Setting seed to 17

----------End global rank 3 STDOUT----------
----------Begin global rank 3 STDERR----------
Error executing job with overrides: []
Error in call to target 'diffusion.models.models.stable_diffusion_2':
TypeError("UNet2DConditionModel.init() got an unexpected keyword argument 'dual_cross_attention'")
full_key: model

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

----------End global rank 3 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 40550) exited with code -15

@Landanjs
Copy link
Contributor

Can you provide the output of pip list from the machine you are trying to run this on?

@wangmiaowei
Copy link
Author

WARNING: Ignoring invalid distribution -orch (/root/envs/py39_dl/lib/python3.9/site-packages)
Package Version Editable project location


absl-py 1.4.0
accelerate 0.19.0
addict 2.4.0
aiofiles 23.1.0
aiohttp 3.8.4
aiosignal 1.3.1
albumentations 1.3.0
altair 4.2.2
antlr4-python3-runtime 4.9.3
anyio 3.7.0
appdirs 1.4.4
argcomplete 3.1.1
arrow 1.2.3
asttokens 2.2.1
async-timeout 4.0.2
attrs 23.1.0
azure-core 1.27.1
azure-storage-blob 12.16.0
backcall 0.2.0
backoff 2.2.1
bcrypt 4.0.1
beautifulsoup4 4.12.2
bitsandbytes 0.39.0
blinker 1.6.2
boto3 1.26.156
botocore 1.29.156
braceexpand 0.1.7
Brotli 1.0.9
brotlipy 0.7.0
cachetools 5.3.0
certifi 2023.5.7
cffi 1.15.1
chardet 5.1.0
charset-normalizer 2.0.4
cholespy 0.1.6
circuitbreaker 1.4.0
click 8.1.3
clip 1.0
cmake 3.26.3
colorlog 6.7.0
comm 0.1.3
ConfigArgParse 1.5.3
contourpy 1.0.7
coolname 2.2.0
cos-python-sdk-v5 1.9.24
crcmod 1.7
cryptography 39.0.1
cubvh 0.1.0
cycler 0.11.0
Cython 0.29.34
dash 2.10.0
dash-core-components 2.0.0
dash-html-components 2.0.0
dash-table 5.0.0
datasets 2.12.0
debugpy 1.6.7
decorator 5.1.1
deepspeed 0.9.2
diffusers 0.17.0.dev0
diffusion 0.0.1 /root/programs_wmw/sd_train/diffusion-main
dill 0.3.6
docker 6.1.3
docker-pycreds 0.4.0
docopt 0.6.2
dominate 2.7.0
easydict 1.10
einops 0.6.1
entrypoints 0.4
exceptiongroup 1.1.1
executing 1.2.0
fastapi 0.95.2
fastjsonschema 2.17.1
ffmpy 0.3.0
filelock 3.12.0
fire 0.5.0
Flask 1.1.2
fonttools 4.39.4
frozenlist 1.3.3
fsspec 2023.5.0
ftfy 6.1.1
future 0.18.3
gdown 4.7.1
gitdb 4.0.10
GitPython 3.1.31
glfw 2.5.9
google-auth 2.18.0
google-auth-oauthlib 1.0.0
gql 3.4.1
gradio 3.32.0
gradio_client 0.2.5
graphql-core 3.2.3
grpcio 1.54.2
h11 0.14.0
hjson 3.1.0
HTML4Vision 0.4.3
httpcore 0.17.2
httpx 0.24.1
huggingface-hub 0.14.1
hydra-colorlog 1.2.0
hydra-core 1.3.2
idna 3.4
igl 2.2.1
imageio 2.28.1
imageio-ffmpeg 0.4.8
importlib-metadata 6.6.0
importlib-resources 5.12.0
ipykernel 6.23.1
ipython 8.13.2
ipywidgets 8.0.6
isodate 0.6.1
itsdangerous 2.0.1
jedi 0.18.2
Jinja2 3.0.3
jmespath 1.0.1
joblib 1.2.0
jsonschema 4.17.3
jupyter_client 8.2.0
jupyter_core 5.3.0
jupyterlab-widgets 3.0.7
kiwisolver 1.4.4
kornia 0.6.12
lazy_loader 0.2
lightning-utilities 0.8.0
linkify-it-py 2.0.2
lit 16.0.3
lpips 0.1.4
Markdown 3.4.3
markdown-it-py 2.2.0
MarkupSafe 2.1.2
matplotlib 3.7.1
matplotlib-inline 0.1.6
mdit-py-plugins 0.3.3
mdurl 0.1.2
mosaicml 0.15.0 /root/programs_wmw/pkgs/composer-dev
mosaicml-cli 0.4.10
mosaicml-streaming 0.5.1
mpmath 1.3.0
multidict 6.0.4
multiprocess 0.70.14
mypy-extensions 1.0.0
nbformat 5.7.0
nest-asyncio 1.5.6
networkx 3.1
ninja 1.11.1
numpy 1.22.3
nvdiffrast 0.3.0 /root/envs/py39_dl/lib/python3.9/site-packages/nvdiffrast-0.3.0-py3.9.egg
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-ml-py3 7.352.0
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
oauthlib 3.2.2
oci 2.104.2
omegaconf 2.3.0
open3d 0.17.0
opencv-python 4.7.0.72
opencv-python-headless 4.7.0.72
orjson 3.8.14
packaging 22.0
pandas 2.0.1
paramiko 3.2.0
parso 0.8.3
pathtools 0.1.2
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.4.0
pip 23.1.2
pipreqs 0.4.13
platformdirs 3.5.1
plotly 5.14.1
prometheus-client 0.8.0
prompt-toolkit 3.0.38
protobuf 3.20.3
psutil 5.9.5
ptyprocess 0.7.0
pudb 2022.1.3
pure-eval 0.2.2
py-cpuinfo 9.0.0
pyarrow 12.0.0
pyasn1 0.5.0
pyasn1-modules 0.3.0
pycparser 2.21
pycryptodome 3.17
pydantic 1.10.8
pydeck 0.8.1b0
pyDeprecate 0.3.1
pydub 0.25.1
PyGLM 2.7.0
Pygments 2.15.1
pyk4a 1.5.0
pymeshlab 2022.2.post3
Pympler 1.0.1
PyNaCl 1.5.0
PyOpenGL 3.1.6
pyOpenSSL 23.0.0
pyparsing 3.0.9
pyquaternion 0.9.9
pyre-extensions 0.0.23
pyrsistent 0.19.3
PySocks 1.7.1
python-dateutil 2.8.2
python-multipart 0.0.6
python-snappy 0.6.1
pytorch-lightning 1.4.2
pytorch-ranger 0.1.1
pytz 2023.3
PyWavelets 1.4.1
PyYAML 6.0
pyzmq 25.1.0
qudida 0.0.4
questionary 1.10.0
redis 4.5.5
regex 2023.5.5
requests 2.29.0
requests-oauthlib 1.3.1
resize-right 0.0.2
responses 0.18.0
rich 13.3.5
rsa 4.9
ruamel.yaml 0.17.32
ruamel.yaml.clib 0.2.7
s3transfer 0.6.1
safetensors 0.3.1
scikit-image 0.20.0
scikit-learn 1.2.2
scipy 1.8.1
semantic-version 2.10.0
sentry-sdk 1.25.1
setproctitle 1.3.2
setuptools 66.0.0
six 1.16.0
smmap 5.0.0
smplx 0.1.28
sniffio 1.3.0
soupsieve 2.4.1
stack-data 0.6.2
starlette 0.27.0
streamlit 1.22.0
sympy 1.12
tabulate 0.9.0
taming-transformers 0.0.1
tenacity 8.2.2
tensorboard 2.13.0
tensorboard-data-server 0.7.0
tensorboardX 2.6
termcolor 2.3.0
test-tube 0.7.5
threadpoolctl 3.1.0
tifffile 2023.4.12
tokenizers 0.13.3
toml 0.10.2
toolz 0.12.0
torch 1.13.1
torch-ema 0.3
torch-fidelity 0.3.0
torch-optimizer 0.3.0
torch-scatter 2.1.1+pt113cu117
torch-sparse 0.6.17+pt113cu117
torchaudio 0.13.1
torchdata 0.6.1
torchmetrics 0.11.4
torchtext 0.14.1
torchvision 0.14.1
tornado 6.3.2
tqdm 4.65.0
traitlets 5.9.0
transformers 4.29.1
trimesh 3.21.6
triton 2.0.0
typing_extensions 4.5.0
typing-inspect 0.8.0
tzdata 2023.3
tzlocal 5.0.1
uc-micro-py 1.0.2
urllib3 1.26.15
urwid 2.1.2
urwid-readline 0.13
uvicorn 0.22.0
validators 0.20.0
wandb 0.15.4
watchdog 3.0.0
wcwidth 0.2.6
webdataset 0.2.48
websocket-client 1.6.0
websockets 10.4
Werkzeug 1.0.1
wheel 0.38.4
widgetsnbextension 4.0.7
xatlas 0.0.7
xformers 0.0.16
xmltodict 0.13.0
xxhash 3.2.0
yarg 0.1.9
yarl 1.9.2
zipp 3.15.0
zstd 1.5.5.1

@wangmiaowei
Copy link
Author

By the way: /root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
/root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
[2023-06-21 16:23:25,908][composer.utils.reproducibility][INFO] - Setting seed to 17
[2023-06-21 16:23:25,988][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1)
Python 3.9.16 (you have 3.9.16)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
ERROR:composer.cli.launcher:Rank 6 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 1594036) exited with code 143
Global rank 1 (PID 1594037) exited with code 143
----------Begin global rank 1 STDOUT----------
[2023-06-21 16:23:25,799][composer.utils.reproducibility][INFO] - Setting seed to 17
[2023-06-21 16:23:25,863][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1)
Python 3.9.16 (you have 3.9.16)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details

----------End global rank 1 STDOUT----------
----------Begin global rank 1 STDERR----------
/root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")

----------End global rank 1 STDERR----------
Global rank 2 (PID 1594038) exited with code 143
----------Begin global rank 2 STDOUT----------
[2023-06-21 16:23:26,238][composer.utils.reproducibility][INFO] - Setting seed to 17
[2023-06-21 16:23:26,295][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1)
Python 3.9.16 (you have 3.9.16)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details

----------End global rank 2 STDOUT----------
----------Begin global rank 2 STDERR----------
/root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")

----------End global rank 2 STDERR----------
Global rank 3 (PID 1594039) exited with code 143
----------Begin global rank 3 STDOUT----------
[2023-06-21 16:23:26,143][composer.utils.reproducibility][INFO] - Setting seed to 17
[2023-06-21 16:23:26,194][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1)
Python 3.9.16 (you have 3.9.16)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details

----------End global rank 3 STDOUT----------
----------Begin global rank 3 STDERR----------
/root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")

----------End global rank 3 STDERR----------
Global rank 4 (PID 1594040) exited with code 143
----------Begin global rank 4 STDOUT----------
[2023-06-21 16:23:26,017][composer.utils.reproducibility][INFO] - Setting seed to 17
[2023-06-21 16:23:26,068][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1)
Python 3.9.16 (you have 3.9.16)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details

----------End global rank 4 STDOUT----------
----------Begin global rank 4 STDERR----------
/root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")

----------End global rank 4 STDERR----------
Global rank 5 (PID 1594041) exited with code 143
----------Begin global rank 5 STDOUT----------
[2023-06-21 16:23:25,865][composer.utils.reproducibility][INFO] - Setting seed to 17
[2023-06-21 16:23:25,922][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1)
Python 3.9.16 (you have 3.9.16)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details

----------End global rank 5 STDOUT----------
----------Begin global rank 5 STDERR----------
/root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")

----------End global rank 5 STDERR----------
Global rank 6 (PID 1594042) exited with code 1
----------Begin global rank 6 STDOUT----------
[2023-06-21 16:23:25,761][composer.utils.reproducibility][INFO] - Setting seed to 17
[2023-06-21 16:23:25,863][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1)
Python 3.9.16 (you have 3.9.16)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details

----------End global rank 6 STDOUT----------
----------Begin global rank 6 STDERR----------
/root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
Error executing job with overrides: []
Error in call to target 'diffusion.models.models.stable_diffusion_2':
RuntimeError('CUDA error: invalid device ordinal\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
full_key: model

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

----------End global rank 6 STDERR----------
Global rank 7 (PID 1594043) exited with code 143
----------Begin global rank 7 STDOUT----------
[2023-06-21 16:23:25,912][composer.utils.reproducibility][INFO] - Setting seed to 17
[2023-06-21 16:23:25,994][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1)
Python 3.9.16 (you have 3.9.16)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details

----------End global rank 7 STDOUT----------
----------Begin global rank 7 STDERR----------
/root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")

----------End global rank 7 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 1594036) exited with code 143

@Landanjs
Copy link
Contributor

Hello, apologies for the delay. This seems like a setup, but I can't pinpoint exactly what is going wrong.

A few questions:

  • What changes did you make to the SD-2-base-512.yaml file?
  • Could you try running this with diffusers 0.16? This is the version we used for our training run and recently pinned this version in our setup.py
  • The second issue seems to be related to your CUDA install, could you provide the output to to nvidia-smi?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants