Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] failed to install torch-tensorrt #3903

Open
3 of 6 tasks
geraldstanje opened this issue May 10, 2024 · 12 comments
Open
3 of 6 tasks

[bug] failed to install torch-tensorrt #3903

geraldstanje opened this issue May 10, 2024 · 12 comments

Comments

@geraldstanje
Copy link

geraldstanje commented May 10, 2024

Checklist

  • I've prepended issue tag with type of change: [bug]
  • (If applicable) I've attached the script to reproduce the bug
  • (If applicable) I've documented below the DLC image/dockerfile this relates to
  • (If applicable) I've documented below the tests I've run on the DLC image
  • I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
  • I've built my own container based off DLC (and I've attached the code used to build my own image)
  • [ ]
    Concise Description:
    i get error: failed to install torch-tensorrt.

Error Message:

09T18:21:42.631Z INFO: pip is looking at multiple versions of torch-tensorrt to determine which version is compatible with other requirements. This could take a while.
2024-05-09T18:21:42.882Z Collecting torch-tensorrt (from -r /opt/ml/model/code/requirements.txt (line 6)) Using cached torch_tensorrt-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB) Using cached torch-tensorrt-0.0.0.post1.tar.gz (9.0 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'error' error: subprocess-exited-with-error × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [13 lines of output] Traceback (most recent call last): File "", line 2, in File "", line 34, in File "/home/model-server/tmp/pip-install-_yc29umj/torch-tensorrt_47a74f002be54836bec3589380d28c89/setup.py", line 125, in raise RuntimeError(open("ERROR.txt", "r").read()) RuntimeError: ########################################################################################### The package you are trying to install is only a placeholder project on PyPI.org repository. To install Torch-TensorRT please run the following command: $ pip install torch-tensorrt -f https://github.com/NVIDIA/Torch-TensorRT/releases ########################################################################################### [end of output] note: This error originates from a subprocess, and is likely not a problem with pip.
2024-05-09T18:21:42.882Z error: metadata-generation-failed
2024-05-09T18:21:42.882Z × Encountered error while generating package metadata.

entire log:
Logs:

2024-05-09T17:52:56.655Z	Sagemaker TS environment variables have been set and will be used for single model endpoint.
2024-05-09T17:52:56.655Z	Collecting sagemaker-inference==1.10.1 (from -r /opt/ml/model/code/requirements.txt (line 1)) Downloading sagemaker_inference-1.10.1.tar.gz (23 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'done'
2024-05-09T17:52:56.808Z	Collecting setfit==1.0.1 (from -r /opt/ml/model/code/requirements.txt (line 2)) Downloading setfit-1.0.1-py3-none-any.whl.metadata (11 kB)
2024-05-09T17:52:56.808Z	Collecting transformers==4.37.2 (from -r /opt/ml/model/code/requirements.txt (line 3)) Downloading transformers-4.37.2-py3-none-any.whl.metadata (129 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.4/129.4 kB 9.3 MB/s eta 0:00:00
2024-05-09T17:52:56.808Z	Requirement already satisfied: torch==2.1.0 in /opt/conda/lib/python3.10/site-packages (from -r /opt/ml/model/code/requirements.txt (line 4)) (2.1.0+cu118)
2024-05-09T17:52:57.059Z	Collecting optimum (from -r /opt/ml/model/code/requirements.txt (line 5)) Downloading optimum-1.19.2-py3-none-any.whl.metadata (19 kB)
2024-05-09T17:52:57.059Z	Collecting torch-tensorrt (from -r /opt/ml/model/code/requirements.txt (line 6)) Downloading torch_tensorrt-1.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
2024-05-09T17:52:57.059Z	Requirement already satisfied: boto3 in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.28.60)
2024-05-09T17:52:57.059Z	Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.24.4)
2024-05-09T17:52:57.059Z	Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.16.0)
2024-05-09T17:52:57.059Z	Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (5.9.5)
2024-05-09T17:52:57.059Z	Requirement already satisfied: retrying<1.4,>=1.3.3 in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.3.4)
2024-05-09T17:52:57.059Z	Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.10.1)
2024-05-09T17:52:57.059Z	Collecting datasets>=2.3.0 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Downloading datasets-2.19.1-py3-none-any.whl.metadata (19 kB)
2024-05-09T17:52:57.059Z	Collecting sentence-transformers>=2.2.1 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Downloading sentence_transformers-2.7.0-py3-none-any.whl.metadata (11 kB)
2024-05-09T17:52:57.310Z	Collecting evaluate>=0.3.0 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
2024-05-09T17:52:57.310Z	Collecting huggingface-hub>=0.13.0 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Downloading huggingface_hub-0.23.0-py3-none-any.whl.metadata (12 kB)
2024-05-09T17:52:57.560Z	Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) (1.1.3)
2024-05-09T17:52:57.560Z	Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (3.13.1)
2024-05-09T17:52:57.560Z	Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (23.1)
2024-05-09T17:52:58.061Z	Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (6.0)
2024-05-09T17:52:58.061Z	Collecting regex!=2019.12.17 (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) Downloading regex-2024.4.28-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.8/40.8 kB 17.1 MB/s eta 0:00:00
2024-05-09T17:52:58.311Z	Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (2.31.0)
2024-05-09T17:52:58.562Z	Collecting tokenizers<0.19,>=0.14 (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
2024-05-09T17:52:58.562Z	Collecting safetensors>=0.4.1 (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) Downloading safetensors-0.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
2024-05-09T17:52:58.562Z	Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (4.66.4)
2024-05-09T17:52:58.562Z	Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (4.9.0)
2024-05-09T17:52:58.563Z	Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (1.12)
2024-05-09T17:52:58.563Z	Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (3.2.1)
2024-05-09T17:52:58.563Z	Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (3.1.4)
2024-05-09T17:52:58.563Z	Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (2023.12.2)
2024-05-09T17:52:58.814Z	Collecting coloredlogs (from optimum->-r /opt/ml/model/code/requirements.txt (line 5)) Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
2024-05-09T17:52:58.814Z	INFO: pip is looking at multiple versions of torch-tensorrt to determine which version is compatible with other requirements. This could take a while.
2024-05-09T17:52:59.065Z	Collecting torch-tensorrt (from -r /opt/ml/model/code/requirements.txt (line 6)) Downloading torch_tensorrt-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB) Downloading torch-tensorrt-0.0.0.post1.tar.gz (9.0 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'error' error: subprocess-exited-with-error × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [13 lines of output] Traceback (most recent call last): File "<string>", line 2, in <module> File "<pip-setuptools-caller>", line 34, in <module> File "/home/model-server/tmp/pip-install-ndpb_izf/torch-tensorrt_1eaee9fc2794472ca9b57c4ba02da88f/setup.py", line 125, in <module> raise RuntimeError(open("ERROR.txt", "r").read()) RuntimeError: ########################################################################################### The package you are trying to install is only a placeholder project on PyPI.org repository. To install Torch-TensorRT please run the following command: $ pip install torch-tensorrt -f https://github.com/NVIDIA/Torch-TensorRT/releases ########################################################################################### [end of output] note: This error originates from a subprocess, and is likely not a problem with pip.
2024-05-09T17:52:59.065Z	error: metadata-generation-failed
2024-05-09T17:52:59.065Z	× Encountered error while generating package metadata.
2024-05-09T17:52:59.065Z	╰─> See above for output.
2024-05-09T17:52:59.065Z	note: This is an issue with the package mentioned above, not pip.
2024-05-09T17:52:59.316Z	hint: See above for details.
2024-05-09T17:52:59.316Z	2024-05-09 17:52:59,107 - sagemaker-inference - ERROR - failed to install required packages, exiting
2024-05-09T17:52:59.316Z	Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/sagemaker_inference/model_server.py", line 41, in _install_requirements subprocess.check_call(pip_install_cmd) File "/opt/conda/lib/python3.10/subprocess.py", line 369, in check_call raise CalledProcessError(retcode, cmd)
2024-05-09T17:52:59.316Z	subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-m', 'pip', 'install', '-r', '/opt/ml/model/code/requirements.txt']' returned non-zero exit status 1.
2024-05-09T17:52:59.316Z	During handling of the above exception, another exception occurred:
2024-05-09T17:52:59.316Z	Traceback (most recent call last): File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module> serving.main() File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main _start_torchserve() File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f return Retrying(*dargs, **dkw).call(f, *args, **kw) File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 257, in call return attempt.get(self._wrap_exception) File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get six.reraise(self.value[0], self.value[1], self.value[2]) File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise raise value File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call attempt = Attempt(fn(*args, **kwargs), attempt_number, False) File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve torchserve.start_torchserve(handler_service=HANDLER_SERVICE) File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 79, in start_torchserve model_server._install_requirements() File "/opt/conda/lib/python3.10/site-packages/sagemaker_inference/model_server.py", line 44, in _install_requirements raise ValueError("failed to install required packages")
2024-05-09T17:53:01.977Z	ValueError: failed to install required packages
2024-05-09T17:53:02.072Z	Sagemaker TS environment variables have been set and will be used for single model endpoint.
2024-05-09T17:53:02.573Z	Collecting sagemaker-inference==1.10.1 (from -r /opt/ml/model/code/requirements.txt (line 1)) Using cached sagemaker_inference-1.10.1.tar.gz (23 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'done'
2024-05-09T17:53:02.573Z	Collecting setfit==1.0.1 (from -r /opt/ml/model/code/requirements.txt (line 2)) Using cached setfit-1.0.1-py3-none-any.whl.metadata (11 kB)
2024-05-09T17:53:02.573Z	Collecting transformers==4.37.2 (from -r /opt/ml/model/code/requirements.txt (line 3)) Using cached transformers-4.37.2-py3-none-any.whl.metadata (129 kB)
2024-05-09T17:53:02.573Z	Requirement already satisfied: torch==2.1.0 in /opt/conda/lib/python3.10/site-packages (from -r /opt/ml/model/code/requirements.txt (line 4)) (2.1.0+cu118)
2024-05-09T17:53:02.573Z	Collecting optimum (from -r /opt/ml/model/code/requirements.txt (line 5)) Using cached optimum-1.19.2-py3-none-any.whl.metadata (19 kB)
2024-05-09T17:53:02.573Z	Collecting torch-tensorrt (from -r /opt/ml/model/code/requirements.txt (line 6)) Using cached torch_tensorrt-1.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
2024-05-09T17:53:02.573Z	Requirement already satisfied: boto3 in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.28.60)
2024-05-09T17:53:02.573Z	Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.24.4)
2024-05-09T17:53:02.573Z	Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.16.0)
2024-05-09T17:53:02.573Z	Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (5.9.5)
2024-05-09T17:53:02.573Z	Requirement already satisfied: retrying<1.4,>=1.3.3 in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.3.4)
2024-05-09T17:53:02.824Z	Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.10.1)
2024-05-09T17:53:02.824Z	Collecting datasets>=2.3.0 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Using cached datasets-2.19.1-py3-none-any.whl.metadata (19 kB)
2024-05-09T17:53:02.824Z	Collecting sentence-transformers>=2.2.1 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Using cached sentence_transformers-2.7.0-py3-none-any.whl.metadata (11 kB)
2024-05-09T17:53:02.824Z	Collecting evaluate>=0.3.0 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Using cached evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
2024-05-09T17:53:02.824Z	Collecting huggingface-hub>=0.13.0 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Using cached huggingface_hub-0.23.0-py3-none-any.whl.metadata (12 kB)
2024-05-09T17:53:03.326Z	Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) (1.1.3)
2024-05-09T17:53:03.326Z	Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (3.13.1)
2024-05-09T17:53:03.326Z	Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (23.1)
2024-05-09T17:53:03.576Z	Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (6.0)
2024-05-09T17:53:03.576Z	Collecting regex!=2019.12.17 (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) Using cached regex-2024.4.28-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
2024-05-09T17:53:03.826Z	Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (2.31.0)
2024-05-09T17:53:04.077Z	Collecting tokenizers<0.19,>=0.14 (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) Using cached tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
2024-05-09T17:53:04.077Z	Collecting safetensors>=0.4.1 (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) Using cached safetensors-0.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
2024-05-09T17:53:04.077Z	Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (4.66.4)
2024-05-09T17:53:04.077Z	Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (4.9.0)
2024-05-09T17:53:04.077Z	Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (1.12)
2024-05-09T17:53:04.077Z	Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (3.2.1)
2024-05-09T17:53:04.077Z	Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (3.1.4)
2024-05-09T17:53:04.077Z	Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (2023.12.2)
2024-05-09T17:53:04.328Z	Collecting coloredlogs (from optimum->-r /opt/ml/model/code/requirements.txt (line 5)) Using cached coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
2024-05-09T17:53:04.328Z	INFO: pip is looking at multiple versions of torch-tensorrt to determine which version is compatible with other requirements. This could take a while.
2024-05-09T17:53:04.578Z	Collecting torch-tensorrt (from -r /opt/ml/model/code/requirements.txt (line 6)) Using cached torch_tensorrt-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB) Using cached torch-tensorrt-0.0.0.post1.tar.gz (9.0 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'error' error: subprocess-exited-with-error × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [13 lines of output] Traceback (most recent call last): File "<string>", line 2, in <module> File "<pip-setuptools-caller>", line 34, in <module> File "/home/model-server/tmp/pip-install-ou8dudye/torch-tensorrt_f613c1ea02ee46eba6289ad76ccd02c4/setup.py", line 125, in <module> raise RuntimeError(open("ERROR.txt", "r").read()) RuntimeError: ########################################################################################### The package you are trying to install is only a placeholder project on PyPI.org repository. To install Torch-TensorRT please run the following command: $ pip install torch-tensorrt -f https://github.com/NVIDIA/Torch-TensorRT/releases ########################################################################################### [end of output] note: This error originates from a subprocess, and is likely not a problem with pip.
2024-05-09T17:53:04.578Z	error: metadata-generation-failed
2024-05-09T17:53:04.578Z	× Encountered error while generating package metadata.
2024-05-09T17:53:04.578Z	╰─> See above for output.
2024-05-09T17:53:04.578Z	note: This is an issue with the package mentioned above, not pip.
2024-05-09T17:53:04.578Z	hint: See above for details.
2024-05-09T17:53:04.578Z	2024-05-09 17:53:04,566 - sagemaker-inference - ERROR - failed to install required packages, exiting

code/requirements.txt:

sagemaker-inference==1.10.1
setfit==1.0.1
transformers==4.37.2
torch==2.1.0
optimum
torch-tensorrt

DLC image/dockerfile:
763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1-gpu-py310

Current behavior:
error during installing torch-tensorrt

Expected behavior:
no error

Additional context:

Can i extend the deep learning image for sagemaker as follows, push this image to aws ecr and use that image to deploy my sagemaker inference endpoint? how does the model artifact (code/inference.py code/requirements.txt model etc.) get copied into the docker container?

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1-gpu-py310

RUN pip install torch-tensorrt -f https://github.com/NVIDIA/Torch-TensorRT/releases

i see there are 2 images - can i use both for sagemaker - or only the second one?

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1-gpu-py310

vs.

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker

also the torch-tensorrt 2.2.0 whl file is available here: https://pypi.org/project/torch-tensorrt/2.2.0/ - why it cant find it?

cc @tejaschumbalkar @joaopcm1996

also, torchServe is already at version 0.10 - how can i use that version with 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1-gpu-py310 or 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker? cc @sirutBuasai

@sirutBuasai
Copy link
Contributor

Hi @geraldstanje, we have recently updated torchServe version to 0.11.0. Please pull the latest images to use them.

For tensor-rt, we'll require a repro steps to do so. However, we suggest taking a look at DJL TensorRT containers if you would be interested in that.

For extending DLCs, you can do so as you outlined. Model artifacts are copied into the container at runtime from the Python SDK (which I am assuming is what you're using) through a docker run

For the image tag, the two images you outlined are the same image even though the tags are different. I want to note though that FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1-gpu-py310 is us-west-2 and FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker is us-east-1
If you want to look at all available tags, you can find them in the GitHub release tags and available_images.md.

@geraldstanje
Copy link
Author

geraldstanje commented May 23, 2024

we have recently updated torchServe version to 0.11.0. Please pull the latest images to use them.

whats the name of that pytorch image? e.g. 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker

is that what you refer to? https://github.com/aws/deep-learning-containers/tree/master/pytorch/inference/docker/2.2/py3

  • can you also confirm that cuda driver matches for pytorch 2.2 and is >= 11.8 - which is also required by pytorch/TensorRT: https://github.com/pytorch/TensorRT/releases
  • i can extect the image and install torch-tensorrt 2.2 with this new image?

For tensor-rt, we'll require a repro steps to do so. However, we suggest taking a look at DJL TensorRT containers if you would be interested in that.

why switch to a different image? torch-tensorrt and tensorrt can be used with torchServe...

@sirutBuasai
Copy link
Contributor

sirutBuasai commented May 23, 2024

Any supported PyTorch (PT 1.13, 2.1, 2.2) inference image would work. They all have torchserve 0.11.0. Generally, you can pull images with the following tags:

2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker
2.1-gpu-py310
2.1.0-gpu-py310

These tags would pull our latest release which will be moved to the latest image every time we release a patch.

However, you may see some tags such as

2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker-v1.8
2.1-gpu-py310-cu118-ubuntu20.04-sagemaker-v1
2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker-v1.8-2024-05-22-19-30-53

These tags represent the specific patch releases. so using these tags would pull in specific image that was released at certain date.

can you also confirm that cuda driver matches for pytorch 2.2 and is >= 11.8 - which is also required by pytorch/TensorRT: https://github.com/pytorch/TensorRT/releases

Yes, our gpu inference image is using cuda 11.8.

i can extect the image and install torch-tensorrt 2.2 with this new image?

We don't expect any installation error with tensor-rt but you're welcomed to outline repro steps if you encounter issues and we'll be happy to reproduce and assist.

why switch to a different image? torch-tensorrt and tensorrt can be used with torchServe...

DJL containers offer tensorrt out of the box while our regular DLCs do not. DJL containers can also be used similarly to extend your own custom containers. For more information about DJL containers

@geraldstanje
Copy link
Author

geraldstanje commented May 23, 2024

@sirutBuasai do you also going to release a new pytorch-inference image with cuda 12.x?

@sirutBuasai
Copy link
Contributor

Not for PyTorch 2.1 and 2.2 Inference.

However, we are working on PyTorch 2.3 Inference with CUDA 12.1.
Feel free to track this PR for when it will be released.

@geraldstanje
Copy link
Author

geraldstanje commented May 23, 2024

@sirutBuasai any timeline when PyTorch 2.3 Inference with CUDA 12.1 will be available?

do you also update the triton inference image for cuda 12.x soon?

@sirutBuasai
Copy link
Contributor

We are aiming for 6/7 for PyTorch 2.3 Inference with CUDA 12.1.

Which triton image are you referring to?

@geraldstanje
Copy link
Author

geraldstanje commented May 23, 2024

@sirutBuasai i mean NVIDIA Triton Inference Server: https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only - can someone build the Triton Inference Server Release 24.05?

dont see the nvidia-triton-inference-containers image in this github repo... can you send me the link?

cc @nskool

@sirutBuasai
Copy link
Contributor

@nskool Could you assist with triton image questions?

@geraldstanje
Copy link
Author

geraldstanje commented May 26, 2024

@sirutBuasai - if you go the following link it says:

Dependencies
These are the following dependencies used to verify the testcases.
Torch-TensorRT can work with other versions, but the tests are not guaranteed to pass.

Bazel 5.2.0
Libtorch 2.4.0.dev (latest nightly) (built with CUDA 12.1)
CUDA 12.1
TensorRT 10.0.1.6

https://github.com/pytorch/TensorRT

i use torch-tensorrt 2.2.0 with dlc 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker-v1.10 and get error:

predict_fn error: backend='torch_tensorrt' raised: TypeError: pybind11::init(): factory function returned nullptr

but when i run it on ec2 with cuda - it works fine - it seems i cannot use cuda 11 and require cuda 12.x for torch-tensorrt 2.2.0...

@geraldstanje
Copy link
Author

geraldstanje commented May 29, 2024

regarding NVIDIA Triton Inference Server

NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 12.3 driver version 545.23.08 with kernel driver version 470.182.03.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

cc @nskool @sirutBuasai

@sirutBuasai
Copy link
Contributor

For tensorrt installation error, could you provide the following:

  1. DLC used or any Dockerfile artifact that you've built on top of our DLC if applicable.
  2. Steps to reproduce the error including any installation commands or scripts used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants