Backups failing on one cluster without error message #1513

kovaxur · 2024-04-10T14:01:14Z

Report

We have two clusters managed by the same operator running on kubernetes. Daily backup is set up for them, which works for one cluster, but fails for the other. All the settings are the same for backups, both uses the same s3 bucket.

More about the problem

Error message on the CRD:
some of pbm-agents were lost during the backup

State of the backup is error

Checking the logs of the backup-agent container in one of the pods I see, that it's writing the collections, and then it stopps with the following error message:

2024-04-10T11:47:19.097+0000    Mux close namespace XXXXX                                                                                       
2024-04-10T11:47:19.097+0000    done dumping XXXX (0 documents)                                                                                
2024-04-10T11:47:19.098+0000    writing XXXXX to archive on stdout
2024/04/10 11:47:21 [entrypoint] `pbm-agent` exited with code -1                                                                                                                   
2024/04/10 11:47:21 [entrypoint] restart in 5 sec                                                                                                                                  │
2024/04/10 11:47:26 [entrypoint] starting `pbm-agent`

We had a change on this cluster, when it stopped working, but it was just to increase the resources from c5a.large to c5a.4xlarge. First I thought that maybe the backup agent gets OOMKilled, as it now sees, that there are plenty more resources available, so I decreased the resources (as we don't need increased anymore) to c5a.xlarge, but the issue is still the same.

I was not able to enable debug loggin on the backup-agent, maybe it's not even possible. How could I get more details on the error?

Steps to reproduce

Install cluster via mongodb-operator
Enable backups
Increase cluster resources (also requests/limits)
Backups will fail (?)

Versions

Kubernetes: 1.26.13-eks-508b6b3
Operator: percona/percona-server-mongodb-operator:1.15.0
Backup agent version: percona/percona-backup-mongodb:2.0.4
Mongo version: percona/percona-server-mongodb:5.0.15-13

Anything else?

I also tried to restart the whole cluster, but still the same.

We haven't changed the resources of the other cluster and the backups are working fine there.

The text was updated successfully, but these errors were encountered:

kovaxur added the bug label Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backups failing on one cluster without error message #1513

Backups failing on one cluster without error message #1513

kovaxur commented Apr 10, 2024

Backups failing on one cluster without error message #1513

Backups failing on one cluster without error message #1513

Comments

kovaxur commented Apr 10, 2024

Report

More about the problem

Steps to reproduce

Versions

Anything else?