feat: autorecover from stuck situations #354

gerrnot · 2024-07-12T08:43:58Z

Since on the helm mailing list there was no veto and only one positive feedback from a ranger representative (see references) among the many likes in various related bug threads, I am submitting this new HIP...

lindhe · 2024-07-24T06:35:46Z

Hey, I just want to say that you are awesome for driving this HIP! 💪 I think there is a dire need for this. I've seen many people loose trust in Helm because of things getting stuck irrecoverably, it would be so nice to redeem Helm. Also, this will be of great value for all the tools that depend on Helm (like Rancher).

I'll read your document and put down some comments. On a first read, it's just formatting things or typos I'm guessing. I might contribute something better once I've digested your suggestion.

Tell me if I can help!

hips/hip-9999.md

lindhe · 2024-07-24T06:46:38Z

The --timout parameter gets a deeper meaning.
Previously the --timout parameter only had an effect on the helm process running on the respective client.
After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and
have an indirect impact on possible parallel processes.

I think it is really clever reusing the --timeout flag for this purpose! Nice.

lindhe · 2024-07-24T06:58:25Z

hips/hip-9999.md

+After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and
+have an indirect impact on possible parallel processes.
+
+`helm ls -a` shows two new columns, regular `helm ls` does NOT show those:


I believe that would not be fully backwards compatible. Other programs that depend on Helm may parse the output from helm ls -a and expect the number of columns to be unchanged.

Not sure how bad that is, though. Every change is a breaking change to someone, and I think this seems like a reasonable change. And parsing the default (table) output is not robust anyway, if they do parse the output they should probably use -o json or -o yaml instead.

I am very open to change the behaviour regarding -a, since -a traditionally means all releases (in all states), rather than the colums displayed.

Maybe rather tie it to the existing flag --debug or add a new flag? I actually would prefer that, maybe call it --concurrency? It's really just that most users will never need to see it...

@lindhe : What do you think?

Sidenote: There is also the --short flag, but that false per default, so we cannot use that and having a long flag could yield to funny situations, e.g. specifying short and long....

I'm impartial to changing it or not. If this feature can get added with the modification to -a, I think that's a good change. But if that causes any roadblocks I do not think it's worth it.

I think adding a new flag sounds a bit convoluted. If it cannot be part of -a by default , then it might make sense to tie into --debug like you suggest. But I would like to get some input from Helm maintainers before going too far with the specifics of that implementation. If they agree that it is OK to introduce a new column by default, that's probably the cleanest solution.

lindhe · 2024-07-24T07:04:02Z

I think this HIP has an additional benefit that should probably be mentioned in the document: it should make multi-user scenarios more robust in general, not just fix the situations where Helm gets stuck. What happens today if two clients run similar commands at the same time? If there is no mechanism to handle that, I believe this HIP would be a great addition to handle those scenarios gracefully.

lindhe · 2024-07-24T07:08:52Z

I think you are on the right track with your implementation, trying to keep this lock inside the Helm release object. That way, we do not introduce new objects to the Helm ecosystem, which is a great thing.

That said, I just want to mention that there is a Lease object in Kubernetes. If we were looking for alternatives, that might be a sensible thing to use.

gerrnot · 2024-07-24T19:08:30Z

I think this HIP has an additional benefit that should probably be mentioned in the document: it should make multi-user scenarios more robust in general, not just fix the situations where Helm gets stuck. What happens today if two clients run similar commands at the same time? If there is no mechanism to handle that, I believe this HIP would be a great addition to handle those scenarios gracefully.

There is already a mechanism to detect concurrency, since a certain helm3 version. It is based on the "pending" states in the release object. While implemented with good intentions this has the consequence that one cannot recover anymore in bad cases. So I would not mention this in addition. Nevertheless, if you feel that additional benefits could be mentioned, feel free suggest an edit again, I will happily merge it if it increases the chances of the HIP.

One more thing to discuss that I realized during the implementation:
Helm always uses the client time, e..g for the creation of the UPDATED field etc.

If I would now introduce using the k8s server time it could yield to weird timestamps deviations between other helm timestamp fields vs the ones used for locking. While I am still convinced that server time is better, this would make most sense only if this is refactored within the entire codebase, which I guess is out of scope and would require an independent HIP, what do you think?

kfox1111 · 2024-07-24T19:10:35Z

kubernetes has distributed locking primitives I think? Couldn't one of those be used?

gerrnot · 2024-07-24T19:26:02Z

@kfox1111 I assume you mean the Lease object as suggest by @lindhe.

Yes, that could be used, it is just that the release state is then distributed:

the traditional helm release secret
the lock state in k8s

Since I consider the locked-state as an attribute of the release state, the question is whether is is worth to store it somewhere else...

Nevertheless, it could increase the backwards compatibility if we see it as something that is honored if it is there and else not.

lindhe · 2024-07-30T14:20:24Z

One more thing to discuss that I realized during the implementation: Helm always uses the client time, e..g for the creation of the UPDATED field etc.

If I would now introduce using the k8s server time it could yield to weird timestamps deviations between other helm timestamp fields vs the ones used for locking. While I am still convinced that server time is better, this would make most sense only if this is refactored within the entire codebase, which I guess is out of scope and would require an independent HIP, what do you think?

Sounds like a bummer. I agree 1000% that server time is superior. I would go so far as to say that it is the only sensible thing to use. But I also agree that it probably belongs to another PR (unfortunately).

gerrnot · 2024-08-06T10:11:27Z

original HIP draft - storing lock in k8s secret

A backwards compatible preview version is published as docker image (platform linux/amd64) for use of one's own risk:
gernotfeichter/helm-autorecover:0.0.3
Clarified why client time is used for LOCKED TILL.

possible change of storing the lock as native k8s lock

If agreement is reached that such a solution is better overall, I can invest the effort of rewriting the HIP and the draft code.

benefits

helm ls could be left as is:
- no breaking change in the output
- discussions based on which flag to show the new fields LOCKED TILL and SESSION ID are obsolete.
standardisation is always good (good library support for working with said object)
visibility/accessiblity: The lease object can be inspected by any user with sufficient permission, not even a helm binary is required for that.

drawbacks

The release state is split (one part is stored in a k8s secret, another part - at least temporarily is in a Lease object.).
- Codewise: additional libraries need to be added and additional communication needs to happen to k8s to perform CRUD operations on the Lease object.
- Userwise: The user needs to have additional knowledge in case of debugging concurrency issues.
The user under which the helm operations are performed require CRUD permissions on the Lease object, that is a potential breaking change. We could think of a fallback though, e.g. if insufficient privileges for lease -> ignore log a warning and continue, but this could introduce further risks/instabilities due to concurrency issues if ignored then, which was not the goal here. The goal was to keep concurrent modification protection while also allowing escaping stuck situations.

Based on this, @kfox1111 do you still think the k8s lock (Lease) should be used? I think it might be ok because I estimate that most people will run the helm client with a user with namespace admin permissions.

So I'm only asking for a final ok from a maintainer for the Lease solution before starting further work, I personally think the benefits of the Lease slightly outweigh the alternative solution.

Signed-off-by: Gernot Feichter <gernot.feichter@bearingpoint.com>

Co-authored-by: Andreas Lindhé <7773090+lindhe@users.noreply.github.com> Signed-off-by: gerrnot <gernot.feichter@bearingpoint.com>

Signed-off-by: Gernot Feichter <gernot.feichter@bearingpoint.com>

pull-request-size bot added the size/M label Jul 12, 2024

gerrnot force-pushed the feat/hip-0019 branch from d9870d6 to f6040ca Compare July 12, 2024 08:53

gerrnot mentioned this pull request Jul 23, 2024

Permanent fix for helm release stuck with status "pending-upgrade" or "pending-rollback" helm/helm#11863

Open

gerrnot force-pushed the feat/hip-0019 branch from f6040ca to 9cacda9 Compare July 23, 2024 17:55

pull-request-size bot added size/L and removed size/M labels Jul 23, 2024

lindhe reviewed Jul 24, 2024

View reviewed changes

hips/hip-9999.md Outdated Show resolved Hide resolved

lindhe reviewed Jul 24, 2024

View reviewed changes

hips/hip-9999.md Outdated Show resolved Hide resolved

lindhe reviewed Jul 24, 2024

View reviewed changes

pull-request-size bot added size/M and removed size/L labels Jul 24, 2024

pull-request-size bot added size/L and removed size/M labels Jul 25, 2024

gerrnot force-pushed the feat/hip-0019 branch from 89f5412 to bbaa513 Compare August 6, 2024 09:22

gerrnot and others added 5 commits October 15, 2024 17:39

feat: autorecover from stuck situations

976fd26

Signed-off-by: Gernot Feichter <gernot.feichter@bearingpoint.com>

Update hips/hip-9999.md

ada8829

Co-authored-by: Andreas Lindhé <7773090+lindhe@users.noreply.github.com> Signed-off-by: gerrnot <gernot.feichter@bearingpoint.com>

Update hips/hip-9999.md

03624eb

Co-authored-by: Andreas Lindhé <7773090+lindhe@users.noreply.github.com> Signed-off-by: gerrnot <gernot.feichter@bearingpoint.com>

Update hips/hip-9999.md

74c3e94

Co-authored-by: Andreas Lindhé <7773090+lindhe@users.noreply.github.com> Signed-off-by: gerrnot <gernot.feichter@bearingpoint.com>

feat: autorecover from stuck situations

7b94ce8

Signed-off-by: Gernot Feichter <gernot.feichter@bearingpoint.com>

gerrnot force-pushed the feat/hip-0019 branch from bbaa513 to 7b94ce8 Compare October 15, 2024 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: autorecover from stuck situations #354

feat: autorecover from stuck situations #354

gerrnot commented Jul 12, 2024 •

edited

Loading

lindhe commented Jul 24, 2024

lindhe commented Jul 24, 2024

lindhe Jul 24, 2024

gerrnot Jul 24, 2024

lindhe Jul 30, 2024

lindhe commented Jul 24, 2024

lindhe commented Jul 24, 2024

gerrnot commented Jul 24, 2024

kfox1111 commented Jul 24, 2024

gerrnot commented Jul 24, 2024 •

edited

Loading

lindhe commented Jul 30, 2024

gerrnot commented Aug 6, 2024 •

edited

Loading

feat: autorecover from stuck situations #354

Are you sure you want to change the base?

feat: autorecover from stuck situations #354

Conversation

gerrnot commented Jul 12, 2024 • edited Loading

lindhe commented Jul 24, 2024

lindhe commented Jul 24, 2024

lindhe Jul 24, 2024

Choose a reason for hiding this comment

gerrnot Jul 24, 2024

Choose a reason for hiding this comment

lindhe Jul 30, 2024

Choose a reason for hiding this comment

lindhe commented Jul 24, 2024

lindhe commented Jul 24, 2024

gerrnot commented Jul 24, 2024

kfox1111 commented Jul 24, 2024

gerrnot commented Jul 24, 2024 • edited Loading

lindhe commented Jul 30, 2024

gerrnot commented Aug 6, 2024 • edited Loading

original HIP draft - storing lock in k8s secret

possible change of storing the lock as native k8s lock

benefits

drawbacks

gerrnot commented Jul 12, 2024 •

edited

Loading

gerrnot commented Jul 24, 2024 •

edited

Loading

gerrnot commented Aug 6, 2024 •

edited

Loading