-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPUContainerImage schema with os, arch and cache info - move GPU container images to config #5153
base: master
Are you sure you want to change the base?
Changes from 9 commits
4ddd6d9
cb2439d
0729dd8
6d11f6a
efaf51e
2ced72e
b2cc631
a28eab2
d526a55
4076236
034864b
b3c5c41
60d3638
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,13 +9,27 @@ package components | |
previousLatestVersion?: #ContainerImagePrefetchOptimization | ||
} | ||
|
||
#OSSelector: { | ||
os: string | ||
arch: string | ||
} | ||
|
||
|
||
#ContainerImage: { | ||
downloadURL: string | ||
amd64OnlyVersions: [...string] | ||
multiArchVersionsV2: [...#VersionV2] | ||
} | ||
|
||
#GPUContainerImage: { | ||
downloadURL: string | ||
multiArchVersionsV2: [...#VersionV2] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any recommendations for alternate names for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
cached: bool | ||
osSelectors?: [...#OSSelector] | ||
} | ||
|
||
#Images: [...#ContainerImage] | ||
#GPUImages: [...#GPUContainerImage] | ||
#Packages: [...#Package] | ||
#VersionV2: { | ||
k8sVersion?: string | ||
|
@@ -67,7 +81,8 @@ package components | |
|
||
#Components: { | ||
ContainerImages: #Images | ||
Packages: #Packages | ||
Packages: #Packages | ||
GPUContainerImages?: #GPUImages | ||
} | ||
|
||
#Components |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -351,28 +351,104 @@ INSTALLED_RUNC_VERSION=$(runc --version | head -n1 | sed 's/runc version //') | |
echo " - runc version ${INSTALLED_RUNC_VERSION}" >> ${VHD_LOGS_FILEPATH} | ||
capture_benchmark "${SCRIPT_NAME}_artifact_streaming_download" | ||
|
||
if [[ $OS == $UBUNTU_OS_NAME && $(isARM64) != 1 ]]; then # no ARM64 SKU with GPU now | ||
gpu_action="copy" | ||
NVIDIA_DRIVER_IMAGE_SHA="20241008175307" | ||
export NVIDIA_DRIVER_IMAGE_TAG="550.90.12-${NVIDIA_DRIVER_IMAGE_SHA}" | ||
NVIDIA_DRIVER_IMAGE="mcr.microsoft.com/aks/aks-gpu-cuda" | ||
|
||
mkdir -p /opt/{actions,gpu} | ||
ctr -n k8s.io image pull $NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG | ||
if grep -q "fullgpu" <<< "$FEATURE_FLAGS"; then | ||
bash -c "$CTR_GPU_INSTALL_CMD $NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG gpuinstall /entrypoint.sh install" | ||
ret=$? | ||
if [[ "$ret" != "0" ]]; then | ||
echo "Failed to install GPU driver, exiting..." | ||
exit $ret | ||
fi | ||
gpu_action="" | ||
declare -A pulled_gpu_images | ||
|
||
# Loop over each GPUContainerImage | ||
while IFS= read -r gpuImageToBePulled; do | ||
# Extract 'cached' field and convert it to lowercase | ||
cached=$(echo "${gpuImageToBePulled}" | jq -r '.cached' | tr '[:upper:]' '[:lower:]') | ||
|
||
if [[ "$cached" != "true" ]]; then | ||
# Skip images that are not meant to be cached | ||
continue | ||
fi | ||
|
||
cat << EOF >> ${VHD_LOGS_FILEPATH} | ||
- nvidia-driver=${NVIDIA_DRIVER_IMAGE_TAG} | ||
EOF | ||
# Extract 'osSelectors' if present | ||
osSelectors=$(echo "${gpuImageToBePulled}" | jq -r '.osSelectors // empty') | ||
|
||
shouldPull=0 # Default to not pull | ||
|
||
if [[ -n "$osSelectors" ]]; then | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggested to put this logic into a function and add unit tests to cover most of the if conditions, so that we don't need to rely on abe2e or RP-e2e to capture issues for us. |
||
# osSelectors is provided; check if current OS and arch match any entry | ||
while IFS= read -r selector; do | ||
os=$(echo "$selector" | jq -r '.os') | ||
arch=$(echo "$selector" | jq -r '.arch') | ||
|
||
if [[ "$os" == "$CURRENT_OS" ]]; then | ||
if [[ "$arch" == "$CPU_ARCH" ]]; then | ||
ganeshkumarashok marked this conversation as resolved.
Show resolved
Hide resolved
|
||
shouldPull=1 | ||
break # Found a matching selector | ||
fi | ||
fi | ||
done <<< "$(echo "$osSelectors" | jq -c '.[]')" | ||
else | ||
# No osSelectors provided; decide whether to pull | ||
# Assuming we pull the image if no osSelectors are specified | ||
shouldPull=1 | ||
fi | ||
|
||
if [[ "$shouldPull" == "1" ]]; then | ||
# Extract image details | ||
downloadURL=$(echo "${gpuImageToBePulled}" | jq -r '.downloadURL') | ||
imageName=$(echo "$downloadURL" | sed 's/:.*$//') | ||
|
||
# Get the latestVersion | ||
latestVersion=$(echo "${gpuImageToBePulled}" | jq -r '.multiArchVersionsV2[0].latestVersion') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you only put [0], later on when you add another multiArchVersionsV2[1] block in components.json, it won't capture it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1 example for your reference:
Actually you can reuse the function There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, updated it to reuse that function There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, under no circumstances would we need to pull two CUDA images for the same VHD There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If that's the case, maybe create a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense, agreed. Changed it to so we constrain it to one gpuImageVersion per GPUContainerImage: #GPUContainerImage: { |
||
|
||
if [[ -z "$latestVersion" || "$latestVersion" == "null" ]]; then | ||
echo "Error: latestVersion not found for $imageName" | ||
exit 1 | ||
fi | ||
|
||
fullImage="$imageName:$latestVersion" | ||
|
||
# Pull the image | ||
echo "Pulling image: $fullImage" | ||
ctr -n k8s.io image pull "$fullImage" | ||
if [[ $? -ne 0 ]]; then | ||
echo "Failed to pull image: $fullImage" | ||
exit 1 | ||
fi | ||
|
||
# Record the pulled image | ||
pulled_gpu_images["$imageName"]="$latestVersion" | ||
|
||
# Set gpu_action if pulling the aks-gpu-cuda image | ||
if [[ "$imageName" == "mcr.microsoft.com/aks/aks-gpu-cuda" ]]; then | ||
gpu_action="copy" | ||
|
||
# Create necessary directories | ||
mkdir -p /opt/{actions,gpu} | ||
|
||
# Check for the "fullgpu" feature flag | ||
if grep -q "fullgpu" <<< "$FEATURE_FLAGS"; then | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, avoid more than 2 level nested if. It's hard to keep track which level it is for debugging and readability. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree - thinking about the alternate way. But I think this approach is making it a lot more complex than the alternate PR, which is much smaller: #5138 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right. General vs flexible is always a trade-off. If it can fit your mid-future GPU images, I am fine with it too as this will be used by GPU container images. |
||
echo "Installing GPU driver from image: $fullImage" | ||
bash -c "$CTR_GPU_INSTALL_CMD $fullImage gpuinstall /entrypoint.sh install" | ||
ret=$? | ||
if [[ "$ret" != "0" ]]; then | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, avoid more than 2 level nested if. It's hard to keep track which level it is for debugging and readability. |
||
echo "Failed to install GPU driver, exiting..." | ||
exit $ret | ||
fi | ||
fi | ||
fi | ||
else | ||
echo "Skipping image $imageName due to osSelector constraints or cached=false." | ||
fi | ||
done <<< "$GPUContainerImages" | ||
|
||
# Log the pulled images | ||
if [[ "${#pulled_gpu_images[@]}" -gt 0 ]]; then | ||
echo "Logging pulled GPU images to $VHD_LOGS_FILEPATH" | ||
for imageName in "${!pulled_gpu_images[@]}"; do | ||
imageVersion=${pulled_gpu_images[$imageName]} | ||
echo " - $imageName=$imageVersion" >> "$VHD_LOGS_FILEPATH" | ||
done | ||
else | ||
echo "No GPU images were pulled." | ||
fi | ||
|
||
|
||
ls -ltr /opt/gpu/* >> ${VHD_LOGS_FILEPATH} | ||
|
||
installBpftrace | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent