-
Notifications
You must be signed in to change notification settings - Fork 765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add liveness probe proposal #1552
add liveness probe proposal #1552
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1552 +/- ##
==========================================
+ Coverage 47.90% 49.35% +1.44%
==========================================
Files 162 161 -1
Lines 23491 18628 -4863
==========================================
- Hits 11254 9194 -2060
+ Misses 11017 8211 -2806
- Partials 1220 1223 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
● Take precedence over the community logic, both the native livenessProbe configuration and this solution configuration exist | ||
|
||
### Release Plan | ||
● NodePodProbe controller in kruise-daemonset supports TCP/HTTPGET probe besides EXEC checking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grpc should also be supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
**EnhancedLivenessProbeMapNodePodProbeController** | ||
> For using the nodPodProbe controller processing, | ||
> the livenessProbe config filed in the pod annotations should be converted to the nodePodProbe custom defined resource in Kruise suite. | ||
> This controller can create and update the nodePodProbe resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz explain in detail about the reconcile logic in pod creating, deleting , exited case, and plz illustrate the nodepodprobe in each case.
|
||
> In order to take into account the high availability of the application and service protection, | ||
> the architecture design is coordinated with the PodUnavailableBudget function in Kruise suite and some anomaly detection algorithms(global switch, black list, global limiting rule, etc.). | ||
> These protect the availability of application services before executing the container restart effectively, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz explain in detail about the flow control logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A basic token bucket algorithm framework is released to control the restart frequency of the application with failed livenessProbe pods.
|
||
|
||
### Other Notes | ||
● Pod without OwnerReference, no enhancement ability this solution, degradation to community logic, restarting containers when in failure detection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enhanced liveness can be working on pod without ownerreference if the flow control logic does not rely on the workload of pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes,
● Since the probe detection is implemented by the nodePodProbe controller in Kruise-daemonset, for the pod without ownerreference, this solution can also take effect on the enhancement ability for restarting containers when in failure livenessProbe status.
> the architecture design is coordinated with the PodUnavailableBudget function in Kruise suite and some anomaly detection algorithms(global switch, black list, global limiting rule, etc.). | ||
> These protect the availability of application services before executing the container restart effectively, | ||
> so that there is no service available caused by batch application restarts. | ||
> For example, all containers fail to be fully detected due to the unexpected reasons such as network or the error probe configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz illustrate the container recreate request CR
da4b541
to
f7e505d
Compare
975b33f
to
826f096
Compare
|
||
● EnhancedLivenessProbeWebhook is developed firstly to convert the standard livenessProbe configuration in the special field in pod annotations. | ||
|
||
● EnhancedLivenessProbeMapNodePodProbeController is developed to convert the probe config to the nodePodProbe custom resource in kruise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name(EnhancedLivenessProbeMapNodePodProbeController ) is too long , consider rename it to LivenessNodeProbeController
### Other Notes | ||
● Since the probe detection is implemented by the nodePodProbe controller in Kruise-daemonset, for the pod without ownerreference, this solution can also take effect on the enhancement ability for restarting containers when in failure livenessProbe status. | ||
|
||
● Pod workload need to use the PodUnavailableBudget in kruise, this solution should enable the PubFeature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodUnavailableBudget should not be a required to use enhanced liveness probe
|
||
● Override the community logic, both the native livenessProbe configuration and this solution configuration exist in the process. | ||
|
||
### Release Plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Release Plan -> Implementation plan
> In enhanced livenessProbe controller, the global flow control logic is implemented in Figure 3 as below. | ||
> A basic token bucket algorithm framework is released to control the restart frequency of the application with failed livenessProbe pods. | ||
> When the token is allowed usefully, the algorithm processes the element(nodePodResource in queue) to create a CRR object for restarting the | ||
> failed livenessProbe pods. In this scenario, the algorithm ensures that the controller just create a CRR object for each failed livenessProbe pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz describe the CRR gc logic
> After the enhanced livenessProbe controller watches the state change of node pod probe configured the failed livenessProbe pods, | ||
> this decides immediately whether to perform the restart logic using the CRR(container recreate request) or not. | ||
|
||
> In order to take into account the high availability of the application and service protection, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz explain how to integrate enhanced livenessprobe with PodUnavailableBudget
baacbb1
to
804f141
Compare
Signed-off-by: jicheng.sk <jicheng.sk@alibaba-inc.com>
804f141
to
095ee61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zmberg The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: jicheng.sk <jicheng.sk@alibaba-inc.com>
Ⅰ. Describe what this PR does
The proposal of enhanced livenessProbe solution is develop in the Kruise suite. The enhanced livenessProbe solution are composed of two controllers、one webhook and a node probe detection component,
there are an enhanced livenessProbe webhook, an enhanced livenessProbe map nodePodProbe resource controller,
an enhanced livenessProbe controller and a node detection component.
Ⅱ. Does this pull request fix one issue?
Ⅲ. Describe how to verify it
Ⅳ. Special notes for reviews