-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
control-plane: design for improved data-plane monitoring and shard restarts #1666
Comments
jgraettinger
added a commit
that referenced
this issue
Sep 30, 2024
The runtime invokes a new /notify/shard-failure control-plane API which is told of shard failures that have occurred within a data-plane. At the moment, this API verifies the data-plane token and logs the failure, but takes no further action. Update the taskBase.heartbeatLoop() to perform this notification if the shard's primary loop exits with a non-cancellation error. Issue #1666
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Today, we use a rather blunt means for restarting failed shards (every five minutes, we un-assign any shard which is currently FAILED).
This is simultaneously too large a delay for a first flaky failure of an otherwise healthy task, and too short a delay for a task which only ever errors. We'd like to instead track failures of a task over time and use a more graduated back-off if it continues to fail after successive restarts, likely ultimately disabling the task automatically after sustained failure.
We also now have multiple data-planes, and we want a consolidated mechanism for managing shard failures across all data-planes.
The text was updated successfully, but these errors were encountered: