Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPUContainerImage schema with os, arch and cache info - move GPU container images to config #5153

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

ganeshkumarashok
Copy link
Contributor

@ganeshkumarashok ganeshkumarashok commented Oct 24, 2024

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR moves GPU versions to a config file (components.json), so that Renovate bot can auto-update it. VHD builds will now consume the cuda version from components.json. It also adds a new schema to auto-update.

There are two new requirements:

aks-gpu-cuda container image is only downloaded for particular combo of OS and arch (Ubuntu - amd64),
aks-gpu-grid container image needs to be present in the config but is never downloaded in the VHD. It's only used in CSE, for certain SKUs.

Which issue(s) this PR fixes:

Fixes #

Requirements:

Special notes for your reviewer:

Release note:

none

#ContainerImage: {
downloadURL: string
amd64OnlyVersions: [...string]
multiArchVersionsV2: [...#VersionV2]
}

#GPUContainerImage: {
downloadURL: string
multiArchVersionsV2: [...#VersionV2]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any recommendations for alternate names for multiArchVersionsV2? Thought of Version but VersionV2 already exists

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpuImageVersion?
I 'd love to have GPU in capital letters but it's not following the naming convention.

imageName=$(echo "$downloadURL" | sed 's/:.*$//')

# Get the latestVersion
latestVersion=$(echo "${gpuImageToBePulled}" | jq -r '.multiArchVersionsV2[0].latestVersion')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you only put [0], later on when you add another multiArchVersionsV2[1] block in components.json, it won't capture it.
Suggested to loop the whole multiArchVersionsV2 array. And if you only have 1 multiArchVersionsV2 entry in components.json, it will just work like using [0].

Copy link
Collaborator

@Devinwong Devinwong Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 example for your reference:
At around ln414.

  MULTI_ARCH_VERSIONS=()
  updateMultiArchVersions "${imageToBePulled}"

Actually you can reuse the function updateMultiArchVersions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated it to reuse that function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, under no circumstances would we need to pull two CUDA images for the same VHD

Copy link
Collaborator

@Devinwong Devinwong Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, maybe create a gpuImageVersion and it's just simple struct instead of an array of struct? Then it's much cleaner and explicit enough. Otherwise one day someone will ask, why we were only picking the first element of an array, but we didn't just create a non-array struct? Does that sound making sense to you? In terms of implementation, your way is absolutely right. I was just thinking 1 more step beyond for future readability and maintainability. When I was migrating to components.json, I had a lot of such questions. 😂

Copy link
Contributor Author

@ganeshkumarashok ganeshkumarashok Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, agreed. Changed it to so we constrain it to one gpuImageVersion per GPUContainerImage:

#GPUContainerImage: {
downloadURL: string
gpuIVersion: #VersionV2
cached: bool
osSelectors?: [...#OSSelector]
}

@ganeshkumarashok
Copy link
Contributor Author

This is the much simpler alternate PR we considered (without OS, arch): #5138, and the more complex alternate (adding it to ContainerImages): #5139


shouldPull=0 # Default to not pull

if [[ -n "$osSelectors" ]]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested to put this logic into a function and add unit tests to cover most of the if conditions, so that we don't need to rely on abe2e or RP-e2e to capture issues for us.
One way to do that is put the function into cse_helpers.sh. There is a shellspec unit test file cse_helpers.sh which have some examples to author tests.
The root level readme has some instructions too.

mkdir -p /opt/{actions,gpu}

# Check for the "fullgpu" feature flag
if grep -q "fullgpu" <<< "$FEATURE_FLAGS"; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, avoid more than 2 level nested if. It's hard to keep track which level it is for debugging and readability.

Copy link
Contributor Author

@ganeshkumarashok ganeshkumarashok Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - thinking about the alternate way.

But I think this approach is making it a lot more complex than the alternate PR, which is much smaller: #5138

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. General vs flexible is always a trade-off. If it can fit your mid-future GPU images, I am fine with it too as this will be used by GPU container images.

echo "Installing GPU driver from image: $fullImage"
bash -c "$CTR_GPU_INSTALL_CMD $fullImage gpuinstall /entrypoint.sh install"
ret=$?
if [[ "$ret" != "0" ]]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, avoid more than 2 level nested if. It's hard to keep track which level it is for debugging and readability.

"renovateTag": "registry=https://mcr.microsoft.com, name=aks/aks-gpu-grid",
"latestVersion": "535.161.08-20241021235607"
}
],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants