-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Bubble ImagePullErr and ImagePullBackoff to the Ray CRD #2387
Comments
We are working on improving RayCluster observability with new conditions APIs which should hopefully surface these types of failures. @rueian do you know if the existing implementation would surface ImagePullErr and ImagePullBackoff errors? See https://docs.google.com/document/d/1bRL0cZa87eCX6SI7gqthN68CgmHaB6l3-vJuIse-BrY/edit?usp=sharing for more details. |
Happy to see catching |
@fiedlerNr9 can you try with Kuberay v1.2? You need to enable the feature gate for new conditions API. See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/observability.html#raycluster-status-conditions |
I followed these docs but still see the same behaviour.
Just to be on the same page, I would expect the status of the ray job to change to the status of the underlying pods. |
At the moment we only update the RayCluster status, can you check the status there? We should support mirroring the new conditions in the RayJob status though |
I'm sorry for not getting back to you sooner.
The existing StatusCondition implementation only reflects errors when calling the Kube API. We also haven't changed the old Status behavior even when the RayClusterStatusConditions feature gate is enabled. Therefore, errors like ImagePullErr and ImagePullBackoff are not reflected and not bubbled yet. Right now, we have the following status conditions for the case of ImagePullErr if the feature gate is enabled:
I think we could improve this by carrying the ImagePullErr/ImagePullBackoff messages into the |
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
I deployed a rayjob with a bad image reference (image does not exist)
The RayJob stayed in "Initializing" phase and didn't get updated/bubble up the error from starting the Driver Pod.
Reproduction script
TBD
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: