Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Bubble ImagePullErr and ImagePullBackoff to the Ray CRD #2387

Open
1 of 2 tasks
EngHabu opened this issue Sep 17, 2024 · 7 comments
Open
1 of 2 tasks

[Bug] Bubble ImagePullErr and ImagePullBackoff to the Ray CRD #2387

EngHabu opened this issue Sep 17, 2024 · 7 comments
Labels
bug Something isn't working crd observability

Comments

@EngHabu
Copy link

EngHabu commented Sep 17, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I deployed a rayjob with a bad image reference (image does not exist)
The RayJob stayed in "Initializing" phase and didn't get updated/bubble up the error from starting the Driver Pod.

Reproduction script

TBD

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@EngHabu EngHabu added bug Something isn't working triage labels Sep 17, 2024
@andrewsykim
Copy link
Collaborator

We are working on improving RayCluster observability with new conditions APIs which should hopefully surface these types of failures.

@rueian do you know if the existing implementation would surface ImagePullErr and ImagePullBackoff errors?

See https://docs.google.com/document/d/1bRL0cZa87eCX6SI7gqthN68CgmHaB6l3-vJuIse-BrY/edit?usp=sharing for more details.

@fiedlerNr9
Copy link

Happy to see catching ImagePullBackOff in this doc for Ray CRD. Are there updates on the implementation on this one?

@andrewsykim
Copy link
Collaborator

@fiedlerNr9 can you try with Kuberay v1.2? You need to enable the feature gate for new conditions API. See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/observability.html#raycluster-status-conditions

@fiedlerNr9
Copy link

I followed these docs but still see the same behaviour.

k get pods -n jan-playground-development | grep a27tckdrxzgx2s4kc4gb
a27tckdrxzgx2s4kc4gb-n0-0-raycluster-dcs6k-head-phwb8             0/1     ImagePullBackOff        0          2m35s
a27tckdrxzgx2s4kc4gb-n0-0-raycluster-dcs6k-ray-gro-worker-8gxcl   0/1     Init:ImagePullBackOff   0          2m35s
k get rayjobs -n jan-playground-development
NAME                        JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGE
a27tckdrxzgx2s4kc4gb-n0-0                Initializing        2024-10-04T17:58:32Z              2m19s

Just to be on the same page, I would expect the status of the ray job to change to the status of the underlying pods.

@andrewsykim
Copy link
Collaborator

At the moment we only update the RayCluster status, can you check the status there?

We should support mirroring the new conditions in the RayJob status though

@rueian
Copy link
Contributor

rueian commented Oct 4, 2024

I'm sorry for not getting back to you sooner.

We are working on improving RayCluster observability with new conditions APIs which should hopefully surface these types of failures.

@rueian do you know if the existing implementation would surface ImagePullErr and ImagePullBackoff errors?

See https://docs.google.com/document/d/1bRL0cZa87eCX6SI7gqthN68CgmHaB6l3-vJuIse-BrY/edit?usp=sharing for more details.

The existing StatusCondition implementation only reflects errors when calling the Kube API. We also haven't changed the old Status behavior even when the RayClusterStatusConditions feature gate is enabled. Therefore, errors like ImagePullErr and ImagePullBackoff are not reflected and not bubbled yet.

Right now, we have the following status conditions for the case of ImagePullErr if the feature gate is enabled:

Status:
  Conditions:
    Last Transition Time:   2024-10-04T19:33:59Z
    Message:                containers with unready status: [ray-head]
    Reason:                 ContainersNotReady
    Status:                 False
    Type:                   HeadPodReady
    Last Transition Time:   2024-10-04T19:33:59Z
    Message:                RayCluster Pods are being provisioned for first time
    Reason:                 RayClusterPodsProvisioning
    Status:                 False
    Type:                   RayClusterProvisioned

I think we could improve this by carrying the ImagePullErr/ImagePullBackoff messages into the HeadPodReady and RayClusterProvisioned conditions and then finding a way to bubble this into RayJob.

@kevin85421
Copy link
Member

The RayCluster CRD can be considered a set of Kubernetes ReplicaSets (with each head or worker group similar to a ReplicaSet). Therefore, we aimed to make the observability consistent with ReplicaSets. However, Kubernetes ReplicaSets do not provide information on ImagePullBackOff errors.

For example, I created a ReplicaSet with image nginx:1.210 which doesn't exist.

image

Although this is not supported by Kubernetes ReplicaSets, we have received these requests multiple times. We will take it into consideration. If we decide to support this, we should clearly define which Pod-level errors should be surfaced by the KubeRay CR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working crd observability
Projects
None yet
Development

No branches or pull requests

5 participants