[BUG] Parallel Cluster Recovery didn't work #730

pawellrus · 2024-02-15T12:49:15Z

Version: 2.4.0, 2.5.1
Function: Parallel Cluster Recovery

Description:
After performing upgrade of managed k8s cluster, the hosted opensearch cluster stuck in "waiting for quorum" state.

Scenario:
Performed k8s cluster upgrade. During upgrade all nodes was recreated and pods was rescheduled TWISE at the same time.

Expected behaviour:
All nodes should be started at the same time. (statefulset has podManagementPolicy: Parallel)
Cluster recovered.

Actual behaviour:
Only one node of cluster is started. Other nodes are not starting. Quorum cannot be achieved.
Operator does nothing with this situation. Statefulset has podManagementPolicy: OrderedReady

Thoughts:
I fixed this problem by scaling down operator deployment to 0 and then manually recreating STS with podManagementPolicy: Parallel.

Not sure that I can reproduce that case again. Maybe operator should have some manual trigger to perform parallel cluster restoration in cases like that?

Links2004 · 2024-02-20T07:33:24Z

Hi have seen the same behaviour for a change from 2.10.0 to 2.11.1.
stuck in waiting for quorum until manual intervention.

pstast · 2024-03-08T09:04:33Z

Just had the same issue. All my k8s nodes crashed, but after k8s recovery, OpenSearch cluster did not recovered. Only one pod from statefulset started. I also had to stop the operator and manually change podManagementPolicy: Parallel so that all pods started and OpenSearch recovered.

Unfortunately I did not find anything in logs, there was no error. Operator just did nothing. It did not switch to parallel recovery.

prudhvigodithi · 2024-03-26T20:22:50Z

[Triage]
There a PR contributed by @swoehrl-mw #366 to handle the cluster recovery. When used the latest version of the operator do you guys still see this issue?
@pawellrus @Links2004 @pstast

Thanks
@bbarani

pawellrus · 2024-03-26T21:13:50Z

[Triage] There a PR contributed by @swoehrl-mw #366 to handle the cluster recovery. When used the latest version of the operator do you guys still see this issue? @pawellrus @Links2004 @pstast

Thanks @bbarani

Hello, @prudhvigodithi
This issue persists on 2.4.0 and 2.5.1 in my environment.

pstast · 2024-03-27T09:22:41Z

@prudhvigodithi I also encountered this problem with the last version (2.5.1)

prudhvigodithi · 2024-03-27T13:09:28Z

@swoehrl-mw can you please take a look at this? Thanks

swoehrl-mw · 2024-04-02T06:59:13Z

@pstast @pawellrus @Links2004
To verify:

Do you have PVCs configured as storage?
Is each nodepool configured with at least 3 pods/replicas?
The parallelRecovery mode is not disabled (via helm values)?

There are some points in the logic where the operator can skip the recovery due to errors, but these should all produce some log.

pawellrus · 2024-04-02T20:21:30Z

@swoehrl-mw answer to all questions - yes.
But dashboards has only two replicas.
Btw, I use ArgoCD for deployment.

swoehrl-mw · 2024-04-22T14:14:29Z

Hi all, I did some looking through the code and testing and I think I can reproduce the problem and found the cause:
Disaster recovery does not engage if the operator is doing an upgrade. The problem is that there is a bug in the detection of an upgrade where, due to status information being unclear, it thinks an upgrade is still in progress although it is long finished and so does not engage.
I'll create a PR with a fix in the next few days.

cc @prudhvigodithi

prudhvigodithi · 2024-04-22T22:14:16Z

Thanks @swoehrl-mw we can target this after v2.6.0 release.
@bbarani @swoehrl-mw

### Description Fixes a bug where parallel recovery did not engage if the cluster had gone through a version upgrade beforehand. Reason was that the upgrade logic added the status for each nodepool twice leading to the recovery logic incorrectly detecting if an upgrade was in progress. Also did a small refactoring of names and constants and fixed a warning by my IDE about unused parameters. No changes to the CRDs or functionality, just internal logic fixes. ### Issues Resolved Fixes #730 ### Check List - [x] Commits are signed per the DCO using --signoff - [x] Unittest added for the new/changed functionality and all unit tests are successful - [x] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <sebastian.woehrl@maibornwolff.de>

### Description Fixes a bug where parallel recovery did not engage if the cluster had gone through a version upgrade beforehand. Reason was that the upgrade logic added the status for each nodepool twice leading to the recovery logic incorrectly detecting if an upgrade was in progress. Also did a small refactoring of names and constants and fixed a warning by my IDE about unused parameters. No changes to the CRDs or functionality, just internal logic fixes. ### Issues Resolved Fixes #730 ### Check List - [x] Commits are signed per the DCO using --signoff - [x] Unittest added for the new/changed functionality and all unit tests are successful - [x] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <sebastian.woehrl@maibornwolff.de> (cherry picked from commit 4f59766)

pawellrus added bug Something isn't working untriaged Issues that have not yet been triaged labels Feb 15, 2024

prudhvigodithi removed the untriaged Issues that have not yet been triaged label Mar 26, 2024

swoehrl-mw mentioned this issue Apr 24, 2024

Fix upgrade detection during parallel recovery #789

Merged

4 tasks

swoehrl-mw closed this as completed in #789 May 21, 2024

prudhvigodithi mentioned this issue Jun 24, 2024

OpenSearch K8s Operator not restarting the bootstarp, coordinators, nodes and master pods after upgrade the version #849

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Parallel Cluster Recovery didn't work #730

[BUG] Parallel Cluster Recovery didn't work #730

pawellrus commented Feb 15, 2024 •

edited

Loading

Links2004 commented Feb 20, 2024

pstast commented Mar 8, 2024

prudhvigodithi commented Mar 26, 2024

pawellrus commented Mar 26, 2024

pstast commented Mar 27, 2024

prudhvigodithi commented Mar 27, 2024

swoehrl-mw commented Apr 2, 2024

pawellrus commented Apr 2, 2024

swoehrl-mw commented Apr 22, 2024

prudhvigodithi commented Apr 22, 2024

[BUG] Parallel Cluster Recovery didn't work #730

[BUG] Parallel Cluster Recovery didn't work #730

Comments

pawellrus commented Feb 15, 2024 • edited Loading

Links2004 commented Feb 20, 2024

pstast commented Mar 8, 2024

prudhvigodithi commented Mar 26, 2024

pawellrus commented Mar 26, 2024

pstast commented Mar 27, 2024

prudhvigodithi commented Mar 27, 2024

swoehrl-mw commented Apr 2, 2024

pawellrus commented Apr 2, 2024

swoehrl-mw commented Apr 22, 2024

prudhvigodithi commented Apr 22, 2024

pawellrus commented Feb 15, 2024 •

edited

Loading