-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Api-rest grpc failed #1715
Comments
From the logs seems it either crashed or you had force-deleted the agent-core pod? If crash or force deleted the agent-core can fail to wait for the previous lease to timeout and then crash again. It seems this is bad as we miss the previous-previous logs, since only previous logs are kept. |
Seems when agent-core crashed, the api-rest pod somehow holds on to the old connection and does not release it. |
What makes you say it doesn't release it? From the logs the very next request succeeded:
|
We tried upgrading to 2.7.1, after 1 node failure, trying to get volumes/pools/nodes information:
Api-rest pods are showing:
mayastor-2024-10-08--17-59-12-UTC.tar.gz We tried restarting api-rest pods (we have replica set to 3, one running on each node): Restarting the pod that has the error |
Btw, I'm not sure if using replica count of 3 on the rest-api is helping, I suggest leaving it at 1: I think the first timeout is expected:
The agent-core was restarting (probably it was on the failed node?), and seems DNS was also not ready yet. For the other errors, seems we were having connection issues with etcd, seems some IO issues for the etcd storage:
Interestingly at the same time also io-engine has issues:
I wonder, was there any issue in the cluster at around this time? And in general etcd seems to be having lots of warnings in the logs as you can see, perhaps indicating some issue with the infra?
Are you running on isolated cores? https://openebs.io/docs/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/additional-information/performance-tips#cpu-isolation |
"2024-10-08T17:08:26.487185768" is around when we had to restart all the nodes hoping to resolve this issue. Scaled the api-rest to replicas=1 -- still getting timed out But restarting all nodes did not fix this issue |
Further observations after more testing, restarting the api-rest pod does work. However, after certain amount of time (40 -60 minutes), the pod hung (ie. cannot exec into the pod). If I'm already exec into the pod, commands such as "netstat" failed to respond. operator_diskpool log shows api calls started to fail intermittently, then all requests fail.
|
that's interesting, if you ssh into the node can check what the process is doing? ie is it sleeping or churning cpu? Also interesting, rest requests in general are taking a very long time, close to timeout most likely:
27s, 29s... Would you be able to enable jaeger tracing? https://github.com/openebs/mayastor-extensions/blob/44f76e6520f51ed16d8586bff85b99e109e30ea1/chart/values.yaml#L102 |
Mayastor 2.5.1
kubectl mayastor get volume bb1da7e5-ae9c-4af4-9835-f506abedf1e2
failed to return a response.
API-rest pod is logging:
What is causing this error and how to avoid it?
mayastor-2024-08-08--23-06-57-UTC.tar.gz
The text was updated successfully, but these errors were encountered: