Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows CUDA unittests jobs are failing #8584

Open
NicolasHug opened this issue Aug 12, 2024 · 0 comments
Open

Windows CUDA unittests jobs are failing #8584

NicolasHug opened this issue Aug 12, 2024 · 0 comments

Comments

@NicolasHug
Copy link
Member

NicolasHug commented Aug 12, 2024

#8552 migrated all 3.8 jobs to 3.9. It took a while, and required a bunch of fixes. To avoid blocking it indefinitely, the PR was merged while Windows CUDA unittests jobs were still failing #8552 (comment).

So, the Windows CUDA unittests jobs are failing. And I don't know why.

logs: https://github.com/pytorch/vision/actions/runs/10699721178/job/29661914922?pr=8623

  File "C:\actions-runner\_work\vision\vision\pytorch\vision\test\smoke_test.py", line 113, in <module>
    main()
  File "C:\actions-runner\_work\vision\vision\pytorch\vision\test\smoke_test.py", line 85, in main
    print(f"{torch.ops.image._jpeg_version() = }")
  File "C:\Jenkins\Miniconda3\envs\ci\lib\site-packages\torch\_ops.py", line 1225, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'image' object has no attribute '_jpeg_version'

More detailed error (modifying the import code to be more verbose):

torchvision\io\__init__.py:24: in <module>
    from .image import (
torchvision\io\image.py:11: in <module>
    _load_library("image")
torchvision\extension.py:89: in _load_library
    torch.ops.load_library(lib_path)
C:\Jenkins\Miniconda3\envs\ci\lib\site-packages\torch\_ops.py:1350: in load_library
    ctypes.CDLL(path)
C:\Jenkins\Miniconda3\envs\ci\lib\ctypes\__init__.py:374: in __init__
    self._handle = _dlopen(self._name, mode)
E   FileNotFoundError: Could not find module 'C:\actions-runner\_work\vision\vision\pytorch\vision\torchvision\image.pyd' (or one of its dependencies). Try using the full path with c
onstructor syntax.

Checking the dependencies of image.pyd gives:

runneruser@EC2AMAZ-HAC74MP  /c/actions-runner/_work/vision/vision/pytorch/vision ((efd36a4...))
$ cygcheck.exe torchvision/image.pyd
C:\actions-runner\_work\vision\vision\pytorch\vision\torchvision\image.pyd  C:\Jenkins\Miniconda3\envs\ci\Library\bin\libpng16.dll
    C:\Jenkins\Miniconda3\envs\ci\zlib.dll      C:\Windows\system32\VCRUNTIME140.dll
        C:\Jenkins\Miniconda3\envs\ci\api-ms-win-crt-runtime-l1-1-0.dll        C:\Jenkins\Miniconda3\envs\ci\api-ms-win-crt-heap-l1-1-0.dll
        C:\Jenkins\Miniconda3\envs\ci\api-ms-win-crt-string-l1-1-0.dll        C:\Jenkins\Miniconda3\envs\ci\api-ms-win-crt-stdio-l1-1-0.dll
        C:\Jenkins\Miniconda3\envs\ci\api-ms-win-crt-convert-l1-1-0.dll        C:\Windows\system32\KERNEL32.dll
          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-rtlsupport-l1-1-0.dll          C:\Windows\system32\ntdll.dll
          C:\Windows\system32\KERNELBASE.dll            C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\api-ms-win-eventing-provider-l1-1-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-processthreads-l1-1-0.dll
          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-processthreads-l1-1-1.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-heap-l1-1-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-memory-l1-1-0.dll
          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-handle-l1-1-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-synch-l1-1-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-synch-l1-2-0.dll
          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-file-l1-1-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-file-l1-2-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-namedpipe-l1-1-0.dll
          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-datetime-l1-1-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-sysinfo-l1-2-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-sysinfo-l1-1-0.dll
          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-timezone-l1-1-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-localization-l1-2-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-processenvironment-l1-1-0.dll
          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-string-l1-1-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-debug-l1-1-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-errorhandling-l1-1-0.dll
          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-fibers-l1-1-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-util-l1-1-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-profile-l1-1-0.dll
          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-file-l2-1-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-console-l1-1-0.dll          C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-console-l1-2-0.dll
    C:\Jenkins\Miniconda3\envs\ci\api-ms-win-crt-math-l1-1-0.dll
    C:\Jenkins\Miniconda3\envs\ci\api-ms-win-crt-filesystem-l1-1-0.dll
    C:\Jenkins\Miniconda3\envs\ci\api-ms-win-crt-time-l1-1-0.dll
  C:\Jenkins\Miniconda3\envs\ci\Library\bin\jpeg8.dll
    C:\Jenkins\Miniconda3\envs\ci\api-ms-win-crt-environment-l1-1-0.dll
  C:\Jenkins\Miniconda3\envs\ci\Library\bin\libwebp.dll
    C:\Jenkins\Miniconda3\envs\ci\Library\bin\libsharpyuv.dll
    C:\Jenkins\Miniconda3\envs\ci\api-ms-win-crt-utility-l1-1-0.dll
cygcheck: track_down: could not find nvjpeg64_11.dll

cygcheck: track_down: could not find c10.dll

cygcheck: track_down: could not find torch_cpu.dll

cygcheck: track_down: could not find cudart64_110.dll

cygcheck: track_down: could not find c10_cuda.dll

cygcheck: track_down: could not find torch_cuda.dll

  C:\Windows\system32\MSVCP140.dll
    C:\Windows\system32\VCRUNTIME140_1.dll
    C:\Jenkins\Miniconda3\envs\ci\api-ms-win-crt-locale-l1-1-0.dll

So it seems like nvjpeg and other cuda dependencies cannot be found. In b4c05786e6a7f8f6e1a01d3f9c7ccaf7de1c6830 I removed building with nvjpeg support, and could confirm that the import failure wasn't there anymore.

This seems to suggest that adding "/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin" to the PATH could prevent the problem. But I just tried that by ssh-ing on the machine, and I'm still getting the same error (and I confirm the PATH was OK by running cygcheck again, and confirmed that nvjpeg64_11.dll was found:

...
  C:\Jenkins\Miniconda3\envs\ci\Library\bin\libwebp.dll
    C:\Jenkins\Miniconda3\envs\ci\Library\bin\libsharpyuv.dll
    C:\Jenkins\Miniconda3\envs\ci\api-ms-win-crt-utility-l1-1-0.dll
  C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\nvjpeg64_11.dll
cygcheck: track_down: could not find c10.dll

cygcheck: track_down: could not find torch_cpu.dll

  C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\cudart64_110.dll
    C:\Jenkins\Miniconda3\envs\ci\api-ms-win-core-interlocked-l1-1-0.dll
cygcheck: track_down: could not find c10_cuda.dll

cygcheck: track_down: could not find torch_cuda.dll

  C:\Windows\system32\MSVCP140.dll
    C:\Windows\system32\VCRUNTIME140_1.dll
    C:\Jenkins\Miniconda3\envs\ci\api-ms-win-crt-locale-l1-1-0.dll

CC @atalman @malfet

@NicolasHug NicolasHug mentioned this issue Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant