-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix concurrency issue between AbandonPendingBacklog() and CheckBacklogForTimeouts(), and remove backlog locking #2430
base: main
Are you sure you want to change the base?
Conversation
…gForTimeouts(), and remove backlog locking.
This feels like we're solving the wrong problem; IMO we should be fixing
whatever gap it is falling in currently. I need to look carefully at what
is going on here, but I don't see that we should need the new exception
bits.
…On Wed, 5 Apr 2023, 04:44 Kornel Pal, ***@***.***> wrote:
There is only one ProcessBacklogAsync() thread running at a time and all
current backlog locks are within that thread, so there is no need for these
locks. On the other hand AbandonPendingBacklog() can run concurrently with
the ProcessBacklogAsync() thread, that runs CheckBacklogForTimeouts(), but
AbandonPendingBacklog() is not locking the backlog that can result in
concurrency issues. This can result in CheckBacklogForTimeouts() leaving
the dequeued message abandoned in an uncompleted (hung) state. This fix the
resolves the concurrency issue by introducing an
_abandonPendingBacklogException field that also enables removing the lock.
The "failed" message is completed with the thrown exception to make any
potential concurrency issues more visible.
------------------------------
You can view, comment on, or merge this pull request online at:
#2430
Commit Summary
- 9035679
<9035679>
Fix concurrency issue between AbandonPendingBacklog() and
CheckBacklogForTimeouts(), and remove backlog locking.
File Changes
(1 file
<https://github.com/StackExchange/StackExchange.Redis/pull/2430/files>)
- *M* src/StackExchange.Redis/PhysicalBridge.cs
<https://github.com/StackExchange/StackExchange.Redis/pull/2430/files#diff-c64610826746e4cc2aeb0edf12469d2ea64583486a9246f7493d197bc33c6af1>
(60)
Patch Links:
- https://github.com/StackExchange/StackExchange.Redis/pull/2430.patch
- https://github.com/StackExchange/StackExchange.Redis/pull/2430.diff
—
Reply to this email directly, view it on GitHub
<#2430>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAEHMHLX2PTX3NJDCFAQF3W7TTBPANCNFSM6AAAAAAWTQYAOA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I've created this lock-free fix inspired by this comment. At one point #2397 had the following fix to the same problem using a lock, if you prefer that: private void AbandonPendingBacklog(Exception ex)
{
while (true)
{
Message? next;
lock (_backlog)
{
if (!BacklogTryDequeue(out next)) break;
}
Multiplexer?.OnMessageFaulted(next, ex);
next.SetExceptionAndComplete(ex, this);
}
} |
Please can you be very explicit about what concurrency issue we're discussing? what is the actual symptom/issue that we're looking at resolving here? To understand whether this resolves them, first I need to have a clear vision of what that "them" are. So: talk me through it; what scenario are we discussing? |
Although PhysicalBridge._backlog is a ConcurrentQueue, PhysicalBridge.CheckBacklogForTimeouts() is using it in a non-thread-safe way. The existing comment from that method describes it best:
When not all dequeuers are locking the backlog then CheckBacklogForTimeouts() can dequeue a message then abandon it without ever being completed. Code from inside the lock in CheckBacklogForTimeouts() annotated by me for the problematic scenario: // There is a message in the backlog, so no break.
if (!_backlog.TryPeek(out message)) break;
// The message peeked at has timed out, so no break.
if (!message.HasTimedOut(now, timeout, out var _)) break;
// Another thread without locking the backlog already dequeued the previous message
// between the TryPeek() and BacklogTryDequeue() calls.
// Scenario 1; there were no messages left: This is not really an issue.
// Scenario 2; another message (message2) was dequeued: It may or may not be timed out,
// but the current logic does not care, just abandons the message and it will not be completed
// as it is not stored anywhere else. This is a problem for async messages only,
// not for sync (wait timeout), or F+F (not completed otherwise either).
if (!BacklogTryDequeue(out var message2) || (message != message2))
{
// In both Scenario 1 and 2 the backlog processing thread fails,
// but a new one will be started by the heartbeat or by adding a message to the backlog.
throw new RedisException("Thread safety bug detected! A queue message disappeared while we had the backlog lock");
} Methods dequeuing from the backlog:
Since the two methods that actually lock the backlog cannot run concurrently, the current lock is just an overhead. On the other hand not locking the backlog in AbandonPendingBacklog() can cause the concurrency issue described in the annotated code above that can cause one task per occurrence to be left in a hung state. |
I've added a test for the Dispose() case. It fails without the fix and succeeds with the fix. Should be possible to cause the issue for BacklogPolicy.AbortPendingOnConnectionFailure = true too, but I don't know how to simulate a connection failure with a large backlog. |
I just realized that clearing _abandonPendingBacklogException at the end of AbandonPendingBacklog() can result in CheckBacklogForTimeouts() failing when AbandonPendingBacklog() is running on multiple threads in parallel, so more complexity (like a wrapper for the backlog) would be needed reliable bug detection in CheckBacklogForTimeouts(). |
I have one more idea, inspired by PhysicalBridge.HasPendingCallerFacingItems(); Instead of removing items, CheckBacklogForTimeouts() could be changed to enumerate the items, and ProcessBridgeBacklogAsync() could be changed to ignore completed items. This way the concurrency issue was eliminated and there was no need for a lock or the exception field. Although adds some more compute overhead, checking for timeout in ProcessBridgeBacklogAsync() again might be simpler than adding tweaks at other places to complete timed out sync messages and identify timed out F+F messages (that never have a result box). Update: It might not be a good option as it keeps all the messages when there is an extended outage. |
There is only one ProcessBacklogAsync() thread running at a time and all current backlog locks are within that thread, so there is no need for these locks. On the other hand AbandonPendingBacklog() can run concurrently with the ProcessBacklogAsync() thread, that runs CheckBacklogForTimeouts(), but AbandonPendingBacklog() is not locking the backlog that can result in concurrency issues. This can result in CheckBacklogForTimeouts() leaving the dequeued message abandoned in an uncompleted (hung) state. This fix the resolves the concurrency issue by introducing an _abandonPendingBacklogException field that also enables removing the lock. The "failed" message is completed with the thrown exception to make any potential concurrency issues more visible.