-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPUContainerImage schema with os, arch and cache info - move GPU container images to config #5153
base: master
Are you sure you want to change the base?
Conversation
…r/move_gpu_img_config
…r/gpu_img_with_details
schemas/components.cue
Outdated
#ContainerImage: { | ||
downloadURL: string | ||
amd64OnlyVersions: [...string] | ||
multiArchVersionsV2: [...#VersionV2] | ||
} | ||
|
||
#GPUContainerImage: { | ||
downloadURL: string | ||
multiArchVersionsV2: [...#VersionV2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any recommendations for alternate names for multiArchVersionsV2
? Thought of Version
but VersionV2
already exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpuImageVersion
?
I 'd love to have GPU in capital letters but it's not following the naming convention.
imageName=$(echo "$downloadURL" | sed 's/:.*$//') | ||
|
||
# Get the latestVersion | ||
latestVersion=$(echo "${gpuImageToBePulled}" | jq -r '.multiArchVersionsV2[0].latestVersion') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you only put [0], later on when you add another multiArchVersionsV2[1] block in components.json, it won't capture it.
Suggested to loop the whole multiArchVersionsV2
array. And if you only have 1 multiArchVersionsV2 entry in components.json, it will just work like using [0].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 example for your reference:
At around ln414.
MULTI_ARCH_VERSIONS=()
updateMultiArchVersions "${imageToBePulled}"
Actually you can reuse the function updateMultiArchVersions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated it to reuse that function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, under no circumstances would we need to pull two CUDA images for the same VHD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, maybe create a gpuImageVersion
and it's just simple struct instead of an array of struct? Then it's much cleaner and explicit enough. Otherwise one day someone will ask, why we were only picking the first element of an array, but we didn't just create a non-array struct? Does that sound making sense to you? In terms of implementation, your way is absolutely right. I was just thinking 1 more step beyond for future readability and maintainability. When I was migrating to components.json, I had a lot of such questions. 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, agreed. Changed it to so we constrain it to one gpuImageVersion per GPUContainerImage:
#GPUContainerImage: {
downloadURL: string
gpuIVersion: #VersionV2
cached: bool
osSelectors?: [...#OSSelector]
}
|
||
shouldPull=0 # Default to not pull | ||
|
||
if [[ -n "$osSelectors" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested to put this logic into a function and add unit tests to cover most of the if conditions, so that we don't need to rely on abe2e or RP-e2e to capture issues for us.
One way to do that is put the function into cse_helpers.sh. There is a shellspec unit test file cse_helpers.sh which have some examples to author tests.
The root level readme has some instructions too.
mkdir -p /opt/{actions,gpu} | ||
|
||
# Check for the "fullgpu" feature flag | ||
if grep -q "fullgpu" <<< "$FEATURE_FLAGS"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, avoid more than 2 level nested if. It's hard to keep track which level it is for debugging and readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - thinking about the alternate way.
But I think this approach is making it a lot more complex than the alternate PR, which is much smaller: #5138
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. General vs flexible is always a trade-off. If it can fit your mid-future GPU images, I am fine with it too as this will be used by GPU container images.
echo "Installing GPU driver from image: $fullImage" | ||
bash -c "$CTR_GPU_INSTALL_CMD $fullImage gpuinstall /entrypoint.sh install" | ||
ret=$? | ||
if [[ "$ret" != "0" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, avoid more than 2 level nested if. It's hard to keep track which level it is for debugging and readability.
"renovateTag": "registry=https://mcr.microsoft.com, name=aks/aks-gpu-grid", | ||
"latestVersion": "535.161.08-20241021235607" | ||
} | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR moves GPU versions to a config file (components.json), so that Renovate bot can auto-update it. VHD builds will now consume the cuda version from components.json. It also adds a new schema to auto-update.
There are two new requirements:
aks-gpu-cuda container image is only downloaded for particular combo of OS and arch (Ubuntu - amd64),
aks-gpu-grid container image needs to be present in the config but is never downloaded in the VHD. It's only used in CSE, for certain SKUs.
Which issue(s) this PR fixes:
Fixes #
Requirements:
Special notes for your reviewer:
Release note: