Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split AI Lab Recipes from RHEL AI Images #771

Open
cooktheryan opened this issue Aug 28, 2024 · 13 comments
Open

Split AI Lab Recipes from RHEL AI Images #771

cooktheryan opened this issue Aug 28, 2024 · 13 comments

Comments

@cooktheryan
Copy link
Collaborator

Currently it is very difficult to understand how to contribute new recipes to this repository as it has grown to include additional things outside the scope of the podman extension recipes. The idea would be to somehow split the repositories so that the various stakeholders still have what is needed while making it easy for the community or RH contributors to add content to the individual pieces important to them.

/cc @sallyom @MichaelClifford @rhatdan @Gregory-Pereira

@rhatdan
Copy link
Member

rhatdan commented Aug 28, 2024

There is talk about splitting out the training section into their own repositories. The question is whether their should be 1 or three. ai-training-amd or ai-training/amd

@fabiendupont
Copy link
Contributor

Yes, I have started the exercise to move the bootc images outside of ai-lab-recipes.

Here is an example Github org showing what it could look like: https://github.com/smgglrs-ai/.
The images are pushed to quay.io/smgglrs-ai and a sample of CentOS Stream images is already present. I have confirmed that RHEL images build, but not pushed them.
Fedora tends to be a bit more complicated, because none of the vendors provide RPMs for Fedora 40. So, I shelved Fedora for now.

These images are meant to be used as base images to install AI Lab recipes, so they only have the hardware enablement components and no prebaked application container images or cloud specific tools.

In my opinion, the application container images should be added as we specialize the image for a given recipe. And if we want to ship to a specific cloud, we should add the relevant packages during the final image (AMI, VHD, etc... build, probably as an image builder feature.

@slemeur
Copy link

slemeur commented Aug 29, 2024

cc @jeffmaury @benoitf

@fabiendupont
Copy link
Contributor

Here is a proposal for creating new repositories under https://github.com/containers:

driver-toolkit

This container image can be used by any stack to build out-of-tree drivers for a given kernel.
The images will be tagged with the kernel version, so it's easy to know which kernel it can be used for.
To build images, one would have to pass a build argument with the kernel version. This can be found via skopeo inspect in the Makefile.

bootc-amd-rocm, bootc-intel-gaudi, bootc-nvidia-cuda

The bootc images are derived from the {fedora,centos,rhel}-bootc images. They enable the hardware accelerator for a given stack, up to the container runtime configuration.
The naming convention includes both the vendor and the stack, in order to allow multiple stacks per vendor. For example, the Intel Gaudi and Falcon Shore will coexist.
The output of this repository is base images without any pre-loaded container image, letting users layer them in a separate flow, keeping the bootc-<vendor> images generic. We would keep the additional storage configuration, so that users only have to use podman pull --root /usr/lib/containers/storage.

Cleanup

The other folders under training could be removed at this stage.
The deepspeed, instructlab, model and vllm image have been combined in instructlab, which is built from https://github.com/instructlab/instructlab, with its own lifecycle.
The ilab wrapper could be contributed to the InstructLab project as a way to hide the complexity of the podman/docker command. It is useful in general.
The upgrade-informer logic could become a standalone RPM that is used in all bootc images, if we think it's valuable. It doesn't really belong to AI Lab.
The tests should also be split into the new repositories to provide stack specific test suites.

If we need more images for specific recipes, we can create new repositories or add them to the recipes folder, based on the level of dependency of their lifecycles.
However, I think it is better to contribute to the upstream projects, including build recipes. We can contribute Containerfiles based on Fedora for bleeding edge, as well as CentOS Stream for Enterprise Linux incubation.

@rhatdan
Copy link
Member

rhatdan commented Sep 4, 2024

Why such a huge proliferation of repos? Why not keep them under a bootc-ai repo? or something similarly named.
ai-containers?

@fabiendupont
Copy link
Contributor

The have different lifecycles and require different expertise. An AMD stack contributor may not be relevant for NVIDIA code reviews. And we're currently talking about splitting the repository because of the proliferation of subfolder which complexifies the whole structure.

@lmilbaum
Copy link
Collaborator

lmilbaum commented Sep 6, 2024

Another reason would be CI complexity. The more artifacts the more complex CI.

@rhatdan
Copy link
Member

rhatdan commented Sep 9, 2024

But there is also interaction between these repos, in some cases we want to share content, and not force people to open up the same change in three different repositories. Finally these REPOS are going to be fairly tiny. just a couple of Containerfiles?

@fabiendupont
Copy link
Contributor

These repositories have a similar structure, but they don't really share much. The only thing that is identical is the update service, which could become an RPM to be shipped independently.

@jeffmaury
Copy link
Collaborator

So my understanding is that this repo will have model-servers and recipes kept so this is good for us (Podman AI Lab team)

@rhatdan
Copy link
Member

rhatdan commented Sep 16, 2024

Yes just training is moving out.

@fabiendupont
Copy link
Contributor

Actually, we could keep the training folder for training recipes, but would move most of the current artifacts, because they are not AI recipes.

@rhatdan
Copy link
Member

rhatdan commented Sep 18, 2024

There are no "recipes" for training, this was just thrown there so that we could start the process of building a AI Training project. It can be moved out without affecting other uses of ai-lab-recipes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants