Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
benchmark_trace.py		benchmark_trace.py
synth-1k.jsonl		synth-1k.jsonl

README.md

Getting Started with vLLM + Llama 3.1 405B

This tutorial covers how to use Llama 3.1 405B with vLLM and what performance you should expect when running it. We are actively working with the vLLM community to upstream Llama 3.1 405B support, but until then please use the repos detailed below.

Hardware assumptions of this tutorial. We are using two 8xH100 instances (i.e., p5.48xlarge) for this tutorial but similar hardware should provide similar results.

Detailed Installation and Benchmarking Instructions

For the steps going forward we highly recommend that use hf_transfer when downloading any of the Llama 3.1 405B checkpoints from Hugging Face to get the best throughput. On an AWS instance we are seeing the checkpoint will download in about 20-30 minutes. In vLLM this should be enabled by default if the package is installed (vllm-project/vllm#3817).

Step 1: Install Dependencies

We recommend setting up a virtual environment to get all of your dependencies isolated to avoid potential conflicts.

On each node:

# we recommend setting up a virtual environment for this
virtualenv llama3-venv
source llama3-venv/bin/activate

# Faster ckpt download speed.
pip install huggingface_hub[hf_transfer]

# Install vLLM from Snowflake-Labs. This may take several (5-10) minutes.
pip install git+https://github.com/Snowflake-Labs/vllm.git@llama3-staging-rebase

# Install deepspeed from Snowflake-Labs.
pip install git+https://github.com/Snowflake-Labs/DeepSpeed.git@add-fp8-gemm

Step 2: how to run online benchmarks (PP=2)

On node 1:

ray start --head

On node 2:

ray start --address <node 1 ip>:6379

On node 1:

pip install dataclasses_json

python benchmark_trace.py \
    --backend vllm \
    --trace synth-1k.jsonl \
    --arrival-rate 1 \
    --model meta-llama/Meta-Llama-3.1-405B \
    --pipeline-parallel-size 2 \
    --tensor-parallel-size 8 \
    --enable-chunked-prefill \
    --max-num-seqs 64 \
    --max-num-batched-tokens 512 \
    --gpu-memory-utilization 0.95 \
    --use-allgather-pipeline-comm \
    --disable-log-requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama3.1

llama3.1

README.md

Getting Started with vLLM + Llama 3.1 405B

Detailed Installation and Benchmarking Instructions

Step 1: Install Dependencies

Step 2: how to run online benchmarks (PP=2)

Files

llama3.1

Directory actions

More options

Directory actions

More options

Latest commit

History

llama3.1

Folders and files

parent directory

README.md

Getting Started with vLLM + Llama 3.1 405B

Detailed Installation and Benchmarking Instructions

Step 1: Install Dependencies

Step 2: how to run online benchmarks (PP=2)