Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add liveness probe proposal #1552

Merged

Conversation

BH4AWS
Copy link
Collaborator

@BH4AWS BH4AWS commented Mar 29, 2024

Ⅰ. Describe what this PR does

The proposal of enhanced livenessProbe solution is develop in the Kruise suite. The enhanced livenessProbe solution are composed of two controllers、one webhook and a node probe detection component,
there are an enhanced livenessProbe webhook, an enhanced livenessProbe map nodePodProbe resource controller,
an enhanced livenessProbe controller and a node detection component.

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

@kruise-bot kruise-bot added the size/L size/L: 100-499 label Mar 29, 2024
Copy link

codecov bot commented Mar 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 49.35%. Comparing base (912de49) to head (095ee61).
Report is 28 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1552      +/-   ##
==========================================
+ Coverage   47.90%   49.35%   +1.44%     
==========================================
  Files         162      161       -1     
  Lines       23491    18628    -4863     
==========================================
- Hits        11254     9194    -2060     
+ Misses      11017     8211    -2806     
- Partials     1220     1223       +3     
Flag Coverage Δ
unittests 49.35% <ø> (+1.44%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

● Take precedence over the community logic, both the native livenessProbe configuration and this solution configuration exist

### Release Plan
● NodePodProbe controller in kruise-daemonset supports TCP/HTTPGET probe besides EXEC checking.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grpc should also be supported

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

**EnhancedLivenessProbeMapNodePodProbeController**
> For using the nodPodProbe controller processing,
> the livenessProbe config filed in the pod annotations should be converted to the nodePodProbe custom defined resource in Kruise suite.
> This controller can create and update the nodePodProbe resource.
Copy link
Member

@furykerry furykerry Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz explain in detail about the reconcile logic in pod creating, deleting , exited case, and plz illustrate the nodepodprobe in each case.


> In order to take into account the high availability of the application and service protection,
> the architecture design is coordinated with the PodUnavailableBudget function in Kruise suite and some anomaly detection algorithms(global switch, black list, global limiting rule, etc.).
> These protect the availability of application services before executing the container restart effectively,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz explain in detail about the flow control logic

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A basic token bucket algorithm framework is released to control the restart frequency of the application with failed livenessProbe pods.



### Other Notes
● Pod without OwnerReference, no enhancement ability this solution, degradation to community logic, restarting containers when in failure detection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enhanced liveness can be working on pod without ownerreference if the flow control logic does not rely on the workload of pod

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,
● Since the probe detection is implemented by the nodePodProbe controller in Kruise-daemonset, for the pod without ownerreference, this solution can also take effect on the enhancement ability for restarting containers when in failure livenessProbe status.

> the architecture design is coordinated with the PodUnavailableBudget function in Kruise suite and some anomaly detection algorithms(global switch, black list, global limiting rule, etc.).
> These protect the availability of application services before executing the container restart effectively,
> so that there is no service available caused by batch application restarts.
> For example, all containers fail to be fully detected due to the unexpected reasons such as network or the error probe configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz illustrate the container recreate request CR

@BH4AWS BH4AWS force-pushed the add_enhancedlivenessprobe_proposal branch 3 times, most recently from da4b541 to f7e505d Compare April 2, 2024 15:51
@BH4AWS BH4AWS force-pushed the add_enhancedlivenessprobe_proposal branch 2 times, most recently from 975b33f to 826f096 Compare April 22, 2024 07:16

● EnhancedLivenessProbeWebhook is developed firstly to convert the standard livenessProbe configuration in the special field in pod annotations.

● EnhancedLivenessProbeMapNodePodProbeController is developed to convert the probe config to the nodePodProbe custom resource in kruise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name(EnhancedLivenessProbeMapNodePodProbeController ) is too long , consider rename it to LivenessNodeProbeController

### Other Notes
● Since the probe detection is implemented by the nodePodProbe controller in Kruise-daemonset, for the pod without ownerreference, this solution can also take effect on the enhancement ability for restarting containers when in failure livenessProbe status.

● Pod workload need to use the PodUnavailableBudget in kruise, this solution should enable the PubFeature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodUnavailableBudget should not be a required to use enhanced liveness probe


● Override the community logic, both the native livenessProbe configuration and this solution configuration exist in the process.

### Release Plan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release Plan -> Implementation plan

> In enhanced livenessProbe controller, the global flow control logic is implemented in Figure 3 as below.
> A basic token bucket algorithm framework is released to control the restart frequency of the application with failed livenessProbe pods.
> When the token is allowed usefully, the algorithm processes the element(nodePodResource in queue) to create a CRR object for restarting the
> failed livenessProbe pods. In this scenario, the algorithm ensures that the controller just create a CRR object for each failed livenessProbe pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz describe the CRR gc logic

> After the enhanced livenessProbe controller watches the state change of node pod probe configured the failed livenessProbe pods,
> this decides immediately whether to perform the restart logic using the CRR(container recreate request) or not.

> In order to take into account the high availability of the application and service protection,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz explain how to integrate enhanced livenessprobe with PodUnavailableBudget

@BH4AWS BH4AWS force-pushed the add_enhancedlivenessprobe_proposal branch 3 times, most recently from baacbb1 to 804f141 Compare April 23, 2024 09:35
Signed-off-by: jicheng.sk <jicheng.sk@alibaba-inc.com>
@BH4AWS BH4AWS force-pushed the add_enhancedlivenessprobe_proposal branch from 804f141 to 095ee61 Compare April 23, 2024 12:05
Copy link
Member

@furykerry furykerry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@zmberg
Copy link
Member

zmberg commented Apr 25, 2024

/approve

@kruise-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zmberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kruise-bot kruise-bot merged commit f32a7c8 into openkruise:master Apr 25, 2024
34 checks passed
ABNER-1 pushed a commit to ABNER-1/kruise that referenced this pull request Apr 29, 2024
Signed-off-by: jicheng.sk <jicheng.sk@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants