Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backups failing on one cluster without error message #1513

Open
kovaxur opened this issue Apr 10, 2024 · 0 comments
Open

Backups failing on one cluster without error message #1513

kovaxur opened this issue Apr 10, 2024 · 0 comments
Labels

Comments

@kovaxur
Copy link

kovaxur commented Apr 10, 2024

Report

We have two clusters managed by the same operator running on kubernetes. Daily backup is set up for them, which works for one cluster, but fails for the other. All the settings are the same for backups, both uses the same s3 bucket.

More about the problem

Error message on the CRD:
some of pbm-agents were lost during the backup

State of the backup is error

Checking the logs of the backup-agent container in one of the pods I see, that it's writing the collections, and then it stopps with the following error message:

2024-04-10T11:47:19.097+0000    Mux close namespace XXXXX                                                                                       
2024-04-10T11:47:19.097+0000    done dumping XXXX (0 documents)                                                                                
2024-04-10T11:47:19.098+0000    writing XXXXX to archive on stdout
2024/04/10 11:47:21 [entrypoint] `pbm-agent` exited with code -1                                                                                                                   
2024/04/10 11:47:21 [entrypoint] restart in 5 sec                                                                                                                                  │
2024/04/10 11:47:26 [entrypoint] starting `pbm-agent`  

We had a change on this cluster, when it stopped working, but it was just to increase the resources from c5a.large to c5a.4xlarge. First I thought that maybe the backup agent gets OOMKilled, as it now sees, that there are plenty more resources available, so I decreased the resources (as we don't need increased anymore) to c5a.xlarge, but the issue is still the same.

I was not able to enable debug loggin on the backup-agent, maybe it's not even possible. How could I get more details on the error?

Steps to reproduce

  1. Install cluster via mongodb-operator
  2. Enable backups
  3. Increase cluster resources (also requests/limits)
  4. Backups will fail (?)

Versions

  1. Kubernetes: 1.26.13-eks-508b6b3
  2. Operator: percona/percona-server-mongodb-operator:1.15.0
  3. Backup agent version: percona/percona-backup-mongodb:2.0.4
  4. Mongo version: percona/percona-server-mongodb:5.0.15-13

Anything else?

I also tried to restart the whole cluster, but still the same.

We haven't changed the resources of the other cluster and the backups are working fine there.

@kovaxur kovaxur added the bug label Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant