Add partial packet detection and fixup #2714

Wraith2 · 2024-07-24T21:04:54Z

Split out from #2608 per discussion detailed in #2608 (comment)

Adds packet multiplexer and covering tests.

Wraith2 · 2024-07-25T18:20:52Z

I've added comments to the Packet class as requested. The CI was green apart from some ubuntu legs which timed out, many other ubuntu legs succeeded so I don't see any direct inference on from that.

Ready for review @David-Engel @saurabh500 @cheenamalhotra

saurabh500 · 2024-07-30T21:41:28Z

@Wraith2 We are reviewing this and hope to get faster traction towards EOW.
Wanted to give an update, instead of maintaining radio silence.

cc @VladimirReshetnikov

cheenamalhotra · 2024-08-06T09:49:43Z

Pasting test failure for reference:

    Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.CancelAsyncConnections [2 m 38 s]
EXEC : error Message:  [/mnt/vss/_work/1/s/build.proj]
     Assert.Empty() Failure: Collection was not empty
  Collection: ["Microsoft.Data.SqlClient.SqlException (0x80131904)"···]
    Stack Trace:
       at Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.RunCancelAsyncConnections(SqlConnectionStringBuilder connectionStringBuilder) in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs:line 66
     at Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.CancelAsyncConnections() in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs:line 31
     at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
     at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)

    Standard Output Messages:
   00:00:05.8665447 True Started:8 Done:0 InFlight:8 RowsRead:39 ResultRead:3 PoisonedEnded:1 nonPoisonedExceptions:0 PoisonedCleanupExceptions:0 Count:0 Found:0
   00:00:10.8624529 True Started:12 Done:0 InFlight:12 RowsRead:832 ResultRead:64 PoisonedEnded:6 nonPoisonedExceptions:6 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:15.8646242 True Started:17 Done:0 InFlight:17 RowsRead:2327 ResultRead:179 PoisonedEnded:11 nonPoisonedExceptions:9 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:20.8677772 True Started:42 Done:6 InFlight:36 RowsRead:4810 ResultRead:370 PoisonedEnded:18 nonPoisonedExceptions:14 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:25.8731904 True Started:71 Done:12 InFlight:59 RowsRead:9126 ResultRead:702 PoisonedEnded:30 nonPoisonedExceptions:29 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:30.8714979 True Started:77 Done:14 InFlight:63 RowsRead:12207 ResultRead:939 PoisonedEnded:38 nonPoisonedExceptions:36 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:35.0004685 True Started:86 Done:25 InFlight:61 RowsRead:17173 ResultRead:1321 PoisonedEnded:49 nonPoisonedExceptions:43 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:39.9987443 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:44.9985663 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:49.9982022 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:54.9982968 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:59.9996354 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:04.9991460 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:09.9983868 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:14.9975925 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:19.9977701 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:25.0122289 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:30.0025709 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:35.0024237 True Started:98 Done:65 InFlight:33 RowsRead:32344 ResultRead:2488 PoisonedEnded:65 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:40.0025057 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:45.0002633 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:49.9986071 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:54.9998736 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:59.9957745 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:04.9985369 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:09.9982641 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:14.9983408 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:19.9988637 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:25.0003251 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:29.9988943 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:35.0000752 True Started:100 Done:99 InFlight:1 RowsRead:50830 ResultRead:3910 PoisonedEnded:99 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:38.0752114 True Started:100 Done:100 InFlight:0 RowsRead:51376 ResultRead:3952 PoisonedEnded:100 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   Microsoft.Data.SqlClient.SqlException (0x80131904): A severe error occurred on the current command.  The results, if any, should be discarded.
      at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlE

This test should be looked at carefully.

It failed on Ubuntu with .NET 6 and 8 , and also hung up on Windows when ran with Managed SNI, link to logs 1 link to logs 2.
In this use case, multiple parallel async read operations are being performed, which means connection isolation should be intact while cancellation occurs in between, but it doesn't seem to be happening.

@Wraith2 can you confirm if this is something you're able to repro in Windows with Managed SNI? Please make sure config file is configured to enable Managed SNI on Windows.

cheenamalhotra · 2024-08-06T09:57:27Z

src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.Multiplexer.cs

+                {
+                    // Do nothing with callback if closed or broken and error not 0 - callback can occur
+                    // after connection has been closed.  PROBLEM IN NETLIB - DESIGN FLAW.
+                    return;


Can you add a Debug Assert here and check if this is taking any hit?

cheenamalhotra

Test needs to be fixed, before reviewing any further.

Wraith2 · 2024-08-06T10:24:21Z

Isn't this the set of tests that @David-Engel pointed out in #2608 (comment) ? If so we discussed it at length on the teams call. I don't believe that those tests are reliable.

Setup a breakpoint or Debug.WriteLine where an exception is added to the state object and run the test. You should find that an exception is always added to the state object but that the test will usually succeed. That should not be possible, an exception if added should be thrown. The test is missing failures and if that's the case then the test is unreliable.

Wraith2 · 2024-08-06T14:41:05Z

When you work past the terrible code in SNITCPHandle and make the test run for long enough it settles into a steady state where it can't reach the end. There is no indication why yet.

00:05:45.3696502 True Started:97 Done:88 InFlight:9 RowsRead:241216 ResultRead:3769 PoisonedEnded:88 nonPoisonedExceptions:0 PoisonedCleanupExceptions:0 Count:0 Found:0
00:05:50.3638134 True Started:97 Done:88 InFlight:9 RowsRead:241216 ResultRead:3769 PoisonedEnded:88 nonPoisonedExceptions:0 PoisonedCleanupExceptions:0 Count:0 Found:0

those 9 in flight items just don't seem to complete but i don't know why.
This is going to need your help from MS side to identify what's going on here.

Wraith2 · 2024-08-06T19:26:50Z

After a few more hours investigation I know what the problem is but I have no clue what change has caused it.

In SqlDataReader when an async method is called we use a context object to contain some state and pass that context object to all the async methods that are used to implement the async read machinery. Part of this state is the TaskCompletionSource.
When running the test CancelAsyncConnections many connections are opened and then SqlCommand.Cancel is called after a brief timed wait. If the timing of the cancel operation is exact then an async operation can be in progress and between packets at the time when the cancellation is executed.
This causes the thread awaiting the async operation to wait indefinitely for a task that will never be completed. This is what causes the stuck threads. The threads as stuck so the test can never complete and it then times out.

What I don't understand is how cancel is supposed to work. I'm unable to run the tests in native sni mode because the native sni can't be initialized (can't find the sni dll). So I can't compare the managed to unmanaged implementations here. I don't believe that I have made any change that should affect cancellation. I have verified that there are no partial packets in the state objects when the async tasks get stuck.

I don't understand how async cancellation is supposed to work at all.

Wraith2 · 2024-08-10T09:28:25Z

Can someone with CI access rerun the failed legs? the failures are random or CI resources not being available as far as i can tell.

Wraith2 · 2024-08-10T18:36:15Z

The current failures are interesting. They're in the test that was failing before but they new ones are only detected because i made the test more accurate.

  [xUnit.net 00:06:40.39]     Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.CancelAsyncConnections [FAIL]
  [xUnit.net 00:06:40.39]       Assert.Empty() Failure: Collection was not empty
  [xUnit.net 00:06:40.39]       Collection: ["Microsoft.Data.SqlClient.SqlException (0x80131904)"···]
  [xUnit.net 00:06:40.39]       Stack Trace:
  [xUnit.net 00:06:40.39]         /_/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs(71,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.RunCancelAsyncConnections(SqlConnectionStringBuilder connectionStringBuilder)
  [xUnit.net 00:06:40.39]         /_/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs(32,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.CancelAsyncConnections()
  [xUnit.net 00:06:40.39]            at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
  [xUnit.net 00:06:40.39]            at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)
  [xUnit.net 00:06:40.39]       Output:
  [xUnit.net 00:06:40.39]         00:00:05.4318805 True Started:21 Done:0 InFlight:21 RowsRead:117 ResultRead:9 PoisonedEnded:4 nonPoisonedExceptions:2 PoisonedCleanupExceptions:0 Count:1 Found:0
  [xUnit.net 00:06:40.39]         00:00:10.4374767 True Started:25 Done:0 InFlight:25 RowsRead:1469 ResultRead:113 PoisonedEnded:11 nonPoisonedExceptions:9 PoisonedCleanupExceptions:0 Count:1 Found:0
  [xUnit.net 00:06:40.39]         00:00:15.4529038 True Started:31 Done:1 InFlight:30 RowsRead:4732 ResultRead:364 PoisonedEnded:14 nonPoisonedExceptions:13 PoisonedCleanupExceptions:0 Count:1 Found:0
  [xUnit.net 00:06:40.39]         00:00:20.4568918 True Started:67 Done:12 InFlight:55 RowsRead:7852 ResultRead:604 PoisonedEnded:28 nonPoisonedExceptions:21 PoisonedCleanupExceptions:0 Count:1 Found:0
  [xUnit.net 00:06:40.39]         00:00:24.9990795 True Started:91 Done:32 InFlight:59 RowsRead:19955 ResultRead:1535 PoisonedEnded:43 nonPoisonedExceptions:35 PoisonedCleanupExceptions:0 Count:1 Found:0
  [xUnit.net 00:06:40.39]         00:00:28.1341854 True Started:100 Done:100 InFlight:0 RowsRead:52273 ResultRead:4021 PoisonedEnded:100 nonPoisonedExceptions:44 PoisonedCleanupExceptions:0 Count:1 Found:0
  [xUnit.net 00:06:40.39]         Microsoft.Data.SqlClient.SqlException (0x80131904): A severe error occurred on the current command.  The results, if any, should be discarded.
  [xUnit.net 00:06:40.39]            at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlE
    Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.CancelAsyncConnections [28 s]
EXEC : error Message:  [/mnt/vss/_work/1/s/build.proj]
     Assert.Empty() Failure: Collection was not empty

The previous version of the test accepted any exception when it was expecting a cancellation exception. It was passing on netfx with my previous changes because timeout exceptions were being thrown. I judged that accepting a timeout when we were supposed to be testing whether cancellation had occurred was not correct.

If we retained the previous version of the test then everything would have passed cleanly. In the current situation since the test completed correctly without hanging the result is equivalent to what we would have experienced in all test runs in the past, all started threads that we expected to be cancelled exited with an exception.

[edit]
I ran that single test locally in Net6 managed SNI mode using the vs "Run Until Failure" option. This runs the test up to 1000 times sequentially and stops if it fails. It completed 1000 runs successfully.

David-Engel · 2024-08-12T17:06:02Z

@Wraith2 You might be banging your head against an unrelated issue in the driver. IIRC, the test was only introduced to ensure we don't regress "The MARS TDS header contained errors." issue. (The test code came from the repro.)

If you isolate your test changes and run them against main code, does it still fail? Yes, the correct exception is probably "Operation cancelled by user." where the exception is being caught. But if it's unrelated to your other changes, I would leave that part of the test as it was and file a new issue with repro code. As it is, it's unclear if and how this behavior is impacting users and I wouldn't hold up your perf improvements for it.

Wraith2 · 2024-08-12T17:38:54Z

There was definitely a real problem. The results differed between main and my branch. I've solved that issue now and the current state is that we're seeing a real failure because I've made the test more sensitive. I think it's probably safe to lower the sensitivity of the test again now because the new test that I've added covers the specific scenario in the multiplexer that I had missed and everything else is pass. I'll try that and see how the CI likes it.

I think the current state on this branch is that it is as stable as live. We need to have confidence that this set of changes is correct before we can merge it. It's high risk and high complexity code. Even understanding it very deeply it has taken me a week to actively debug a very important behaviour change that I missed.

Wraith2 · 2024-08-12T19:27:55Z

Can someone re-run the failed legs? the only failing test is something to do with event counters which I've been no-where near.

Wraith2 · 2024-08-12T22:00:58Z

The failing test is EventCounter_ReclaimedConnectionsCounter_Functional. It's doing something with GC specific to net6. It's failing sporadically on net6 managed sni runs but not deterministically. I can't make it fail locally to trace what might be happening.

Wraith2 · 2024-08-19T19:36:15Z

Any thoughts?

David-Engel · 2024-08-26T21:52:58Z

I'm not seeing the failures you mentioned in EventCounter_ReclaimedConnectionsCounter_Functional [in the CI results]. I mainly see fairly consistent failures of CancelAsyncConnections on Linux. It seems to pass on Windows managed SNI, so there might be something that is Linux/Ubuntu network specific. Can you run/debug the test against a local WSL or Docker instance?

Wraith2 · 2024-08-26T23:02:03Z

If i click through the failure i get to this page https://sqlclientdrivers.visualstudio.com/public/_build/results?buildId=95784&view=ms.vss-test-web.build-test-results-tab

The cancel tests are passing now, those failed in the previous runs but not the current ones.

David-Engel · 2024-08-27T17:52:46Z

I think there is something wrong with the Tests tab. I don't usually reference it. I scroll down the summary tab to see what jobs had failures:

Then drill into the job and the task that failed to see the log:

Wraith2 · 2024-08-27T18:38:33Z

If it's AsyncCancelledConnectionsTest again then there isn't anything further I can do. That test is multithreaded and timing dependent. I've traced the individual packets through the entire call stack. I've run it for 1000 iterations successfully after fixing a reproducible error in it. If someone can isolate a reproducible problem from it then i'll investigate.

David-Engel · 2024-08-28T20:48:25Z

I chatted with @saurabh500 and I just want to add that this is definitely something we all want to see get merged. It'll just take someone finding time (could take a few days dedicated time) to get their head wrapped around the new code and be able to help repro/debug to find the issue.

Wraith2 · 2024-08-28T20:59:46Z

I'm happy to make myself available to talk through the code with anyone that needs it.

saurabh500 · 2024-09-04T20:34:26Z

@Wraith2 and @David-Engel I was looking at the lifecycle of the snapshots and something that stood out in NetCore vs NetFx is that SqlDataReader for NetCore is storing the cached snapshot with the SqlInternalConnectionTds which is a shared resource among all the SqlDataReader(s) running on a MARS connection.

private void PrepareAsyncInvocation(bool useSnapshot)
{
    // if there is already a snapshot, then the previous async command
    // completed with exception or cancellation.  We need to continue
    // with the old snapshot.
    if (useSnapshot)
    {
        Debug.Assert(!_stateObj._asyncReadWithoutSnapshot, "Can't prepare async invocation with snapshot if doing async without snapshots");

        if (_snapshot == null)
        {
            if (_connection?.InnerConnection is SqlInternalConnection sqlInternalConnection)
            {
                _snapshot = Interlocked.Exchange(ref sqlInternalConnection.CachedDataReaderSnapshot, null) ?? new Snapshot();
            }
            else
            {
                _snapshot = new Snapshot();
            }

This means that we are saving the reader snapshot on the shared resource, which can be overwritten by any other reader.
Also a reader can receive another reader's snapshot.

@Wraith2 have you had a chance to pursue this line of investigation for hanging test?

I wonder if the timing is causing the wrong cached snapshot to be provided to a SqlDataReader, causing data corruption and likely causing a hang.

saurabh500 · 2024-09-04T20:36:28Z

SqlInternalConnection.cs


#if NET6_0_OR_GREATER
        internal SqlCommand.ExecuteReaderAsyncCallContext CachedCommandExecuteReaderAsyncContext;
        internal SqlCommand.ExecuteNonQueryAsyncCallContext CachedCommandExecuteNonQueryAsyncContext;
        internal SqlCommand.ExecuteXmlReaderAsyncCallContext CachedCommandExecuteXmlReaderAsyncContext;

        internal SqlDataReader.Snapshot CachedDataReaderSnapshot;
        internal SqlDataReader.IsDBNullAsyncCallContext CachedDataReaderIsDBNullContext;
        internal SqlDataReader.ReadAsyncCallContext CachedDataReaderReadAsyncContext;
#endif

saurabh500 · 2024-09-04T20:39:04Z

@Wraith2 I see that you had made the changes in the first place. Can you try another PR where you remove the storage of these contexts and snapshots on SqlInternalConnection and with the multiplexing change, try to see if this solves the problem.

Also, I am Happy to be told that my theory is wrong, but I would like to understand how in MARS cases, the shared Cached contexts on InternalConnection is a safe design choice.

Wraith2 · 2024-10-29T08:44:18Z

Thanks both. I always have this problem. I manage the complex stuff ok but make silly errors, in this case omitting a !. I'll push that fix in a few minutes.

Use of PartialPacketContainsCompletePacket depends on the meaning. The implementation needs to be the same because the basis of the decisions need to be the same in both cases. We can't pass it out of the sni function because the parameter signature is conserved but we need to know what it is/was because it tells us whether we need to handle the packet lifetime or not.

MichealZ, is there any chance you could reset the state of the SqlClient repo? on the machine I've been using? I tried getting it up to date with the current repo when i pushed previous changes but it somehow ended up identifying changes in >100 files from netfx, not sure what went wrong...

MichelZ · 2024-10-29T08:45:46Z

MichealZ, is there any chance you could reset the state of the SqlClient repo? on the machine I've been using? I tried getting it up to date with the current repo when i pushed previous changes but it somehow ended up identifying changes in >100 files from netfx, not sure what went wrong...

Sure, give me a few minutes. I always have this problem as well, not sure why :)

MichelZ · 2024-10-29T08:53:00Z

Sure, give me a few minutes. I always have this problem as well, not sure why :)

Done

Wraith2 · 2024-10-29T09:07:51Z

Changes pushed, lets see what the CI thinks now. This includes your Wraith2#5 PR @edwardneal
I'll try and get to the multithreaded failure after work. I think I should start by investigating the same thing that showed me the partial packet problem, being stuck in a network read with a non-null _partialPacket, in theory that should be very rare.

Thanks @MichelZ !

Wraith2 · 2024-10-29T23:02:28Z

Just adding more data. I still get intermittent failures in DataReaderCancellationTest.CancellationTokenIsRespected_ReadAsync. But I also get consistent failures for SqlNotificationTest.Test_SingleDependency_AllDefaults_SqlAuth

I've just tested Test_SingleDependency_AllDefaults_SqlAuth and it's running and succeeding in a loop for me. Can you give it a test on the latest code when you get a chance please.

Wraith2 · 2024-10-30T01:22:49Z

I've just identified the problem with the AsyncCancelledConnectionsTest.CancelAsyncConnections test and I don't think it's anything to do with this PR. I enhanced the tracker dictionary that I added to the test to help me debug into it and added the connection. With the very exact timing given in #2714 (comment) I can reproduce the issue and investigate the connections' state.

Here's what I found:

The connection timeout when running the test suite is infinite. I've hit problems with that before where I've forgotten to start the sql server and the tests don't fail they just hang indefinitely (and confusingly). I believe that we've hit another manifestation of #422 where there just aren't enough resources to do everything and stuff stops working, one of the thing we've seen in 422 is that things that should succeed or error just hang. Combine that with the fact that it's linux networking and it can only be reproduced on low processor count machines and you've got enough suggestive correlation to convince me. I still cannot reproduce the failure at all on any machine I own even in ubuntu limited to 2 processors.

Can anyone else think of any reason this can't be the case? Can anyone think of a way to validate it? I can't get from the DbInternalConnection instance in the debugger to the TdsInternalConnection to inspect any of the TdsStateParserObject state to confirm that the multiplexer related state it contains is consistent. However since we're looking at login code here if there were a problem with it then it would be quite likely to be hit in every other test that requires a connection, which is everything in the manual testing project.

I've attached the updated test file with the tracker and time changed so you can try to replicate and inspect.
AsyncCancelledConnectionsTest.zip

MichelZ · 2024-10-30T07:10:25Z

Building on that, I've added a couple more tracking points: AsyncCancelledConnectionsTest.zip

I think the connection was closed because it was reset already.
What I can see from the added tracking is that it either hangs on ReadAsync() or NextResultAsync()

From the image above we can see that it has done ReadAsync() 52 times successfully.
It does not advance to "read_finished" and it does not track an exception in "exception_read" or "unexpected_exception"

All items that remain in the tracker feature the same pattern.

edwardneal · 2024-10-30T07:50:44Z

I've gathered a set of memory dumps and I'm fairly certain you're correct - it's thread pool exhaustion. I've left process 405155 running on the testbed @MichelZ, along with the various dotnet-dump/dotnet-gcdump/etc. utilities. I have a heap dump which includes the TdsParserStateObject instances, and I'm trying to load the PDBs so that I can interpret it.

In my hanging instance, I can see 8 thread pool threads. One is in use by XUnit, another is in use by the thread waiting on Parallel.ForEach; in both cases, these are waiting on the other six. I think Parallel.ForEach has started four instances, (or at least, I can see four TdsParser/TdsParserStateObjectManaged/SNITCPHandle instances) so there's not a lot of room for manoeuvre if we hit a code path which runs async-over-sync, and it looks like this is what happened - it the remaining six seem to have deadlocked on one another.

I haven't yet tested with a wider thread pool, but I'll do that after work.

MichelZ · 2024-10-30T11:01:11Z

For what it's worth, I can still reproduce it when setting MinThreads to 32, altough less Tasks seem to hang (i.e. I get 3-8 during the default MinThreads, and 1-2 with 32 MinThreads)

I can see that a callback is registered here (code executed in test):

SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs

Line 5579 in 69dc7d4

task.ContinueWith(

But the Continuation is never executed for the hanging tasks here:

SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs

Line 5390 in 69dc7d4

    
           internal static readonly Action<Task<T>, object> s_completeCallback = CompleteAsyncCallCallback;

Maybe that helps getting to the bottom of it?
(I've added simple Console.WriteLine's just before these calls)

It also seems that most calls get short-circuited and completed here, without needing a continuation:

SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs

Line 5573 in 69dc7d4

if (task.IsCompleted)

e.g. in my latest test (with 32 MinThreads) I have 3 Register Continuation, and 1 CompleteAsyncCallback, aka 2 hanging.

Wraith2 · 2024-10-30T14:36:34Z

Your traces concerned me, they indicated that the connection was somehow open which would invalidate my theory.
So i added a state change handler to the connections and dumped the connection state into the tracker on the pre and post debug lines. The output is confusing. I'll reorder it but leave it otherwise unchanged.

Key = "handler", Value = {System.Data.StateChangeEventHandler}
Key = "connection", Value = {InnerConnection = {Microsoft.Data.ProviderBase.DbConnectionClosedNeverOpened}}
Key = "doone.preopen", Value = ({10/30/2024 2:25:27 PM}, Closed)
Key = "(Closed, Open)", Value = (Closed, Open, "   at System.Environment.get_StackTrace()\n   at Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.<>c__DisplayClass8_0.<DoManyAsync>b__0(Object sender, StateChangeEventArgs e) in /home/sqlclient/SqlClient/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs:line 104\n   at Microsoft.Data.SqlClient.SqlConnection.OnStateChange(StateChangeEventArgs stateChange) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs:line 1146\n   at Microsoft.Data.SqlClient.SqlConnection.SetInnerConnectionEvent(DbConnectionInternal to) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnectionHelper.cs:line 261\n   at Microsoft.Data.SqlClient.SqlConnectionFactory.SetInnerConnectionEvent(DbConnection owningObject, DbConnectionInternal to) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnectionFactory.cs:line 275\n   at Microsoft.Data.ProviderBase.DbConnectionClosedConnecting.TryOp...")
Key = "doone.postopen", Value = ({10/30/2024 2:25:28 PM}, Closed)

So the connection is initially closed at the preopen point, then transitions from closed to open (the (Closed, Open) tuple key) then doesn't transition state again but in the postopen debug point it's now in the closed state again. And no exception was handled.

MichelZ · 2024-10-30T15:52:15Z

I have no idea if I'm on the wrong track or not, but I have added the following simple code after this line here:

SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs

Line 5621 in 69dc7d4

).Unwrap();

 var timeout = DateTime.UtcNow.AddMinutes(1);
                while (!testTask.IsCompleted){
                    
                    {
                        Console.WriteLine("Task still waiting.");
                        Thread.Sleep(1000);

                        if (timeout < DateTime.UtcNow){
                            Console.WriteLine("Timeout. Cancelling.");
                            completionSource.TrySetCanceled();
                        }
                    }
                }

(also capturing the Unwrapped task with var testTask = )

Which aborts the Async operation after 1 minute.
This allows the test to complete.

It also means that the TaskCompletionSource here does never complete/fire, it's set to _stateObj._networkPacketTaskSource.
Does that mean we're waiting on the network in this instance for some reason and are not receiving any more data?
Maybe another problem with attention?

Or I could be completely off of course and it has nothing to do with that

Wraith2 · 2024-10-30T16:17:13Z

Yes, that means it waiting on network data. A way to catch that read being started in the debugger would be ideal.

David-Engel · 2024-10-30T16:27:47Z

Just adding more data. I still get intermittent failures in DataReaderCancellationTest.CancellationTokenIsRespected_ReadAsync. But I also get consistent failures for SqlNotificationTest.Test_SingleDependency_AllDefaults_SqlAuth

I've just tested Test_SingleDependency_AllDefaults_SqlAuth and it's running and succeeding in a loop for me. Can you give it a test on the latest code when you get a chance please.

I tested Test_SingleDependency_AllDefaults_SqlAuth against main in my environment and it fails there. So that one isn't an issue with this PR. It's something about my setup. You can ignore it.

MichelZ · 2024-10-30T17:26:53Z

So what I can see from a packet capture is that for "hanging" tasks, the packets are:
SQL batch
Response (with 0xFD10 aka DONE & DONE_COUNT)
Attention
Response (with 0xFD20 aka DONE & DONE_ATTN)

You can find the full capture here of all tasks:
wireshark.zip

In this case there were 3 hanging Tasks, 3, 50 and 51.
The packets are "marked" by modifying the SELECT statement slightly to include the parent id:
builder.AppendLine($"SELECT name, '{parent}' FROM sys.tables");

Here is the memory dump
I can see on the TdsParserStateObjectManaged that _attentionSent = true and _snapshotStatus = 0 (so not received?)

The _partialPacket._buffer is:
0x04 0x01 0x00 0x15 0x00 0x35 0x01 0x00 0xFD 0x20 0x00 0xFD 0x00 0x00 [...]

indicating the 0xFD20 DONE DONE_ATTN arrived.

Wraith2 · 2024-10-30T17:48:05Z

So it's stuck on a network read with a non-null partial packet?

MichelZ · 2024-10-30T19:51:20Z

So it's stuck on a network read with a non-null partial packet?

I'm not sure. I can't see any related stacks in the Parallel Stacks view that would suggest it's waiting on the network.
Additionally, _readingCount is 0, which seems to suggest it's not on a network-waiting path?

MichelZ · 2024-10-30T20:13:40Z

Here's the memory dump

Wraith2 · 2024-10-30T21:01:32Z

I can't see any tasks at all when I break into the stuck test with the debugger. It's all very complicated and very confusing.

Wraith2 · 2024-10-30T21:40:41Z

After cleaning up the debugging i've found that all the stuck connections have an _pendingCallbacks==2 which means 1 from the ctor and 1 from an async read and that there is a complete packet stuck in the _partialPacket. Now to work out how we got there...

Wraith2 · 2024-10-31T00:01:28Z

Pull and try the latest please.

Run: 100
Test run for /home/sqlclient/SqlClient/artifacts/Project/bin/Unix/Debug.AnyCPU.ManualTests/net8.0/ManualTests.dll (.NETCoreApp,Version=v8.0)
VSTest version 17.11.1 (x64)

Starting test execution, please wait...
A total of 1 test files matched the specified pattern.
[xUnit.net 00:00:00.00] xUnit.net VSTest Adapter v2.5.5+28b46ff510 (64-bit .NET 8.0.10)
[xUnit.net 00:00:00.10]   Discovering: ManualTests (method display = ClassAndMethod, method display options = None)
App Context switch Switch.Microsoft.Data.SqlClient.UseManagedNetworkingOnWindows enabled on Unix 5.4.0.198
[xUnit.net 00:00:00.52] ManualTests: Non-serializable data ('System.Object[]') found for 'Microsoft.Data.SqlClient.ManualTesting.Tests.CertificateTestWithTdsServer.BeginWindowsConnectionTest'; falling back to single test case.
[xUnit.net 00:00:00.52] ManualTests: Non-serializable data ('System.Object[]') found for 'Microsoft.Data.SqlClient.ManualTesting.Tests.CertificateTestWithTdsServer.BeginLinuxConnectionTest'; falling back to single test case.
[xUnit.net 00:00:00.69]   Discovered:  ManualTests (found 953 test cases)
[xUnit.net 00:00:00.70]   Starting:    ManualTests (parallel test collections = on [2 threads], stop on fail = off)
[xUnit.net 00:00:44.50]   Finished:    ManualTests
  Passed Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.CancelAsyncConnections [43 s]

Test Run Successful.
Total tests: 1
     Passed: 1
 Total time: 44.9758 Seconds

MichelZ · 2024-10-31T05:11:45Z

This has definitely improved things.
However, I now have this exception:


{[unexpected_exception, Microsoft.VisualStudio.TestPlatform.TestHost.DebugAssertException: Method Debug.Fail failed with '_pendingCallbacks values is invalid after decrementing: -1
', and was translated to Microsoft.VisualStudio.TestPlatform.TestHost.DebugAssertException to avoid terminating the process hosting the test.
   at Microsoft.Data.SqlClient.TdsParserStateObject.DecrementPendingCallbacks(Boolean release) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParserStateObject.netcore.cs:line 249
   at Microsoft.Data.SqlClient.TdsParserStateObject.ReadAsyncCallback(IntPtr key, PacketHandle packet, UInt32 error) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParserStateObject.netcore.cs:line 495
   at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSni(TaskCompletionSource`1 completion) in /_/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs:line 2465
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket() in /_/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs:line 2058
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer() in /_/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs:line 1058
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value) in /_/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs:line 1295
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParser.cs:line 2072
   at Microsoft.Data.SqlClient.SqlDataReader.TryHasMoreRows(Boolean& moreRows) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs:line 3341
   at Microsoft.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs:line 3630
   at Microsoft.Data.SqlClient.SqlDataReader.ReadAsyncExecute(Task task, Object state) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs:line 5032
   at Microsoft.Data.SqlClient.SqlDataReader.InvokeAsyncCall[T](SqlDataReaderBaseAsyncCallContext`1 context) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs:line 5581
   at Microsoft.Data.SqlClient.SqlDataReader.ReadAsync(CancellationToken cancellationToken) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs:line 5013
   at System.Data.Common.DbDataReader.ReadAsync()
   at Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.RunCommand(SqlConnection connection, ConcurrentDictionary`2 tracker, String commandText, Boolean poison, Int32 parent) in /home/sqlclient/SqlClient/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs:line 250
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.UnwrapPromise`1.TrySetFromTask(Task task, Boolean lookForOce)
   at System.Threading.Tasks.Task.RunOrQueueCompletionAction(ITaskCompletionAction completionAction, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
]}

MichelZ · 2024-10-31T07:29:32Z

Additionally, I do see a callstack in the Parallel Stacks view now that is waiting here:

SqlClient/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs

Line 838 in 69dc7d4

SpinWait.SpinUntil(() => Volatile.Read(ref _readingCount) == 0);


System.Private.CoreLib.dll!System.Threading.Thread.Sleep(int millisecondsTimeout) Line 368
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/Thread.cs(368)
System.Private.CoreLib.dll!System.Threading.SpinWait.SpinOnceCore(int sleep1Threshold) Line 193
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/SpinWait.cs(193)
System.Private.CoreLib.dll!System.Threading.SpinWait.SpinUntil(System.Func<bool> condition, int millisecondsTimeout) Line 324
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/SpinWait.cs(324)
System.Private.CoreLib.dll!System.Threading.SpinWait.SpinUntil(System.Func<bool> condition) Line 261
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/SpinWait.cs(261)
Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.TdsParserStateObject.DisposeCounters() Line 838
	at /_/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs(838)
Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SNI.TdsParserStateObjectManaged.Dispose() Line 177
	at /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParserStateObjectManaged.cs(177)
Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.TdsParser.Disconnect() Line 1262
	at /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParser.cs(1262)
Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlInternalConnectionTds.Dispose() Line 757
	at /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlInternalConnectionTds.cs(757)
Microsoft.Data.SqlClient.dll!Microsoft.Data.ProviderBase.DbConnectionPool.DestroyObject(Microsoft.Data.ProviderBase.DbConnectionInternal obj) Line 1073
	at /_/src/Microsoft.Data.SqlClient/src/Microsoft/Data/ProviderBase/DbConnectionPool.cs(1073)
Microsoft.Data.SqlClient.dll!Microsoft.Data.ProviderBase.DbConnectionPool.DeactivateObject(Microsoft.Data.ProviderBase.DbConnectionInternal obj) Line 1031
	at /_/src/Microsoft.Data.SqlClient/src/Microsoft/Data/ProviderBase/DbConnectionPool.cs(1031)
Microsoft.Data.SqlClient.dll!Microsoft.Data.ProviderBase.DbConnectionPool.PutObject(Microsoft.Data.ProviderBase.DbConnectionInternal obj, object owningObject) Line 1800
	at /_/src/Microsoft.Data.SqlClient/src/Microsoft/Data/ProviderBase/DbConnectionPool.cs(1800)
Microsoft.Data.SqlClient.dll!Microsoft.Data.ProviderBase.DbConnectionInternal.CloseConnection(System.Data.Common.DbConnection owningObject, Microsoft.Data.ProviderBase.DbConnectionFactory connectionFactory) Line 504
	at /_/src/Microsoft.Data.SqlClient/src/Microsoft/Data/ProviderBase/DbConnectionInternal.cs(504)
Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlConnection.CloseInnerConnection() Line 1274
	at /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs(1274)
Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlConnection.Close() Line 1325
	at /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs(1325)
Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlConnection.Dispose(bool disposing) Line 173
	at /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnectionHelper.cs(173)
System.ComponentModel.Primitives.dll!System.ComponentModel.Component.Dispose() Line 73
	at /_/src/libraries/System.ComponentModel.Primitives/src/System/ComponentModel/Component.cs(73)
ManualTests.dll!Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.DoOneAsync(Microsoft.Data.SqlClient.SqlConnection marsConnection, System.Collections.Concurrent.ConcurrentDictionary<int, System.Collections.Concurrent.ConcurrentDictionary<string, object>> tracker, string connectionString, bool poison, int parent) Line 178
	at /home/sqlclient/SqlClient/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs(178)
[Resuming Async Method]
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state) Line 179
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs(179)
System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Threading.Tasks.VoidTaskResult>.AsyncStateMachineBox<Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.<DoOneAsync>d__9>.MoveNext(System.Threading.Thread threadPoolThread) Line 376
	at /_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs(376)
System.Private.CoreLib.dll!System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox box, bool allowInlining) Line 795
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/TaskContinuation.cs(795)
System.Private.CoreLib.dll!System.Threading.Tasks.Task.RunContinuations(object continuationObject) Line 3456
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs(3456)
System.Private.CoreLib.dll!System.Threading.Tasks.Task.FinishSlow(bool userDelegateExecute) Line 1999
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs(1999)
System.Private.CoreLib.dll!System.Threading.Tasks.Task.TrySetException(object exceptionObject) Line 3371
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs(3371)
System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Threading.Tasks.VoidTaskResult>.SetException(System.Exception exception, ref System.Threading.Tasks.Task<System.Threading.Tasks.VoidTaskResult> taskField) Line 516
	at /_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs(516)
[Completed] ManualTests.dll!Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.RunCommand(Microsoft.Data.SqlClient.SqlConnection connection, System.Collections.Concurrent.ConcurrentDictionary<int, System.Collections.Concurrent.ConcurrentDictionary<string, object>> tracker, string commandText, bool poison, int parent) Line 314
	at /home/sqlclient/SqlClient/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs(314)
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state) Line 179
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs(179)
System.Private.CoreLib.dll!System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox box, bool allowInlining) Line 795
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/TaskContinuation.cs(795)
System.Private.CoreLib.dll!System.Threading.Tasks.Task.RunContinuations(object continuationObject) Line 3456
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs(3456)
System.Private.CoreLib.dll!System.Threading.Tasks.Task.FinishSlow(bool userDelegateExecute) Line 1999
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs(1999)
System.Private.CoreLib.dll!System.Threading.Tasks.Task.TrySetException(object exceptionObject) Line 3371
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs(3371)
System.Private.CoreLib.dll!System.Threading.Tasks.UnwrapPromise<Microsoft.Data.SqlClient.SqlDataReader>.TrySetFromTask(System.Threading.Tasks.Task task, bool lookForOce) Line 7128
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs(7128)
System.Private.CoreLib.dll!System.Threading.Tasks.Task.RunOrQueueCompletionAction(System.Threading.Tasks.ITaskCompletionAction completionAction, bool allowInlining) Line 3589
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs(3589)
System.Private.CoreLib.dll!System.Threading.Tasks.Task.RunContinuations(object continuationObject) Line 3474
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs(3474)
System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(ref System.Threading.Tasks.Task currentTaskSlot, System.Threading.Thread threadPoolThread) Line 2362
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs(2362)
System.Private.CoreLib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch() Line 989
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs(989)
System.Private.CoreLib.dll!System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() Line 102
	at /_/src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.NonBrowser.cs(102)
[Native to Managed Transition]
[Async Call Stack]
[Async] ManualTests.dll!Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.DoManyAsync(int index, System.Collections.Concurrent.ConcurrentDictionary<int, System.Collections.Concurrent.ConcurrentDictionary<string, object>> tracker, Microsoft.Data.SqlClient.SqlConnectionStringBuilder connectionStringBuilder) Line 134
	at /home/sqlclient/SqlClient/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs(134)

MichelZ · 2024-10-31T08:36:24Z

So this hang seems to be cause by the Debug build from the Assert of _pendingCallbacks here:

SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParserStateObject.netcore.cs

Line 249 in 1f57d73

    
           Debug.Assert((remaining == -1 && SessionHandle.IsNull) || (0 <= remaining && remaining < 3), $"_pendingCallbacks values is invalid after decrementing: {remaining}");

This causes this call to fail:

SqlClient/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs

Line 2443 in 1f57d73

DecrementPendingCallbacks(false); // Failure - we won't receive callback!

Which means that the _readingCount is not decremented here:

SqlClient/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs

Line 2453 in 1f57d73

Interlocked.Decrement(ref _readingCount);

My fix for this is to use another try/finally block to encapsulate everything in the outer finally block - right after the Interlocked.Increment here:

SqlClient/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs

Line 2428 in 1f57d73

Interlocked.Increment(ref _readingCount);


               try
                { }
                finally
                {
                    Interlocked.Increment(ref _readingCount);

                    try {
                        handle = SessionHandle;

                        readFromNetwork = !PartialPacketContainsCompletePacket();
                        if (readFromNetwork)
                        {
                            if (!handle.IsNull)
                            {
                                IncrementPendingCallbacks();

                                readPacket = ReadAsync(handle, out error);

                                if (!(TdsEnums.SNI_SUCCESS == error || TdsEnums.SNI_SUCCESS_IO_PENDING == error))
                                {
                                    DecrementPendingCallbacks(false); // Failure - we won't receive callback!
                                }
                            }
                        }
                        else
                        {
                            readPacket = default;
                            error = TdsEnums.SNI_SUCCESS;
                        }
                    } 
                    finally 
                    {
                        Interlocked.Decrement(ref _readingCount);
                    }
                }

While this makes the test work, I wonder if that DecrementPendingCallbacks assert of _pendingCallbacks values is invalid after decrementing: -1 is a problem. In the test the Exception:

SqlClient/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs

Line 238 in 1f57d73

Assert.Fail("unexpected exception: " + ex.GetType().Name + " " +ex.Message);

gets ignored here, because it's a poison = true run:

SqlClient/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs

Line 154 in 1f57d73

if (!poison)

is this the correct way of doing it?

Wraith2 · 2024-10-31T08:51:18Z

How did we end up with an unbalanced pendingCount? all the locations I changed I attempted to maintain the invariants like that by changing as little as possible. Even with a read from the partialPacket the count should be incremented and decremented correctly.

Wraith2 · 2024-10-31T09:16:38Z

Also note the comment in the function:

        internal int DecrementPendingCallbacks(bool release)
        {
            int remaining = Interlocked.Decrement(ref _pendingCallbacks);
            SqlClientEventSource.Log.TryAdvancedTraceEvent("TdsParserStateObject.DecrementPendingCallbacks | ADV | State Object Id {0}, after decrementing _pendingCallbacks: {1}", _objectID, _pendingCallbacks);
            FreeGcHandle(remaining, release);
            // NOTE: TdsParserSessionPool may call DecrementPendingCallbacks on a TdsParserStateObject which is already disposed
            // This is not dangerous (since the stateObj is no longer in use), but we need to add a workaround in the assert for it
            Debug.Assert((remaining == -1 && SessionHandle.IsNull) || (0 <= remaining && remaining < 3), $"_pendingCallbacks values is invalid after decrementing: {remaining}");
            return remaining;
        }

Tests in this repo are not normally run in Debug mode so this could have existed and been happening for a long time and we wouldn't see it. I've got the test running through your script in a loop with this PR changes and i'll see if it gets to 100. If it does then we'll see what people think.

MichelZ · 2024-10-31T13:43:13Z

I've also managed to get a NullReferenceException here:

SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParserStateObjectManaged.cs

Line 38 in 1f57d73

return p.IsInvalid || source != null;

with this Stack trace:

at Microsoft.Data.SqlClient.SNI.TdsParserStateObjectManaged.CheckPacket(PacketHandle packet, TaskCompletionSource`1 source) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParserStateObjectManaged.cs:line 38
   at Microsoft.Data.SqlClient.TdsParserStateObject.ReadAsyncCallback(IntPtr key, PacketHandle packet, UInt32 error) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParserStateObject.netcore.cs:line 495
   at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSni(TaskCompletionSource`1 completion) in /_/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs:line 2469
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket() in /_/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs:line 2058
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer() in /_/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs:line 1058
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value) in /_/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs:line 1295
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParser.cs:line 2072
   at Microsoft.Data.SqlClient.SqlDataReader.TryHasMoreRows(Boolean& moreRows) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs:line 3341
   at Microsoft.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs:line 3630
   at Microsoft.Data.SqlClient.SqlDataReader.ReadAsyncExecute(Task task, Object state) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs:line 5032
   at Microsoft.Data.SqlClient.SqlDataReader.InvokeAsyncCall[T](SqlDataReaderBaseAsyncCallContext`1 context) in /_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs:line 5581
--- End of stack trace from previous location ---
   at Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.RunCommand(SqlConnection connection, ConcurrentDictionary`2 tracker, String commandText, Boolean poison, Int32 parent) in /home/sqlclient/SqlClient/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs:line 256

This again should only affect Debug builds as far as I can see, but probably still worth fixing?

Wraith2 · 2024-10-31T14:44:39Z

Yes, worth fixing even if it only explains the invariants at that point in the code. Changed to:
Debug.Assert((packet.Type == 0 && PartialPacketContainsCompletePacket()) || (CheckPacket(packet, source) && source != null), "AsyncResult null on callback");

Wraith2 marked this pull request as ready for review July 25, 2024 18:20

saurabh500 self-requested a review July 29, 2024 19:29

cheenamalhotra reviewed Aug 6, 2024

View reviewed changes

cheenamalhotra requested changes Aug 6, 2024

View reviewed changes

This was referenced Aug 9, 2024

Code Cleanup | 123 == x => x == 123 etc #2766

Open

Code Cleanup | Fixing Inconsistent Modifiers #2765

Open

Wraith2 force-pushed the operation-status-part2 branch from 88fbb93 to 8b57818 Compare August 10, 2024 00:26

Wraith2 requested a review from cheenamalhotra August 28, 2024 00:29

review feedback and misc fix

69dc7d4

protect missed locations where network reads can happen

1f57d73

fix CheckPacket assertion

00c2426

Add partial packet detection and fixup #2714

Are you sure you want to change the base?

Add partial packet detection and fixup #2714

Conversation

Wraith2 commented Jul 24, 2024

Wraith2 commented Jul 25, 2024

saurabh500 commented Jul 30, 2024

cheenamalhotra commented Aug 6, 2024 • edited Loading

cheenamalhotra Aug 6, 2024

Choose a reason for hiding this comment

cheenamalhotra left a comment

Choose a reason for hiding this comment

Wraith2 commented Aug 6, 2024

Wraith2 commented Aug 6, 2024

Wraith2 commented Aug 6, 2024

Wraith2 commented Aug 10, 2024

Wraith2 commented Aug 10, 2024 • edited Loading

David-Engel commented Aug 12, 2024

Wraith2 commented Aug 12, 2024

Wraith2 commented Aug 12, 2024

Wraith2 commented Aug 12, 2024

Wraith2 commented Aug 19, 2024

David-Engel commented Aug 26, 2024 • edited Loading

Wraith2 commented Aug 26, 2024

David-Engel commented Aug 27, 2024

Wraith2 commented Aug 27, 2024

David-Engel commented Aug 28, 2024

Wraith2 commented Aug 28, 2024

saurabh500 commented Sep 4, 2024

saurabh500 commented Sep 4, 2024

saurabh500 commented Sep 4, 2024 • edited Loading

Wraith2 commented Oct 29, 2024

MichelZ commented Oct 29, 2024 • edited Loading

MichelZ commented Oct 29, 2024

Wraith2 commented Oct 29, 2024

Wraith2 commented Oct 29, 2024

Wraith2 commented Oct 30, 2024

MichelZ commented Oct 30, 2024 • edited Loading

edwardneal commented Oct 30, 2024

MichelZ commented Oct 30, 2024

Wraith2 commented Oct 30, 2024

MichelZ commented Oct 30, 2024 • edited Loading

Wraith2 commented Oct 30, 2024 • edited Loading

David-Engel commented Oct 30, 2024

MichelZ commented Oct 30, 2024 • edited Loading

Wraith2 commented Oct 30, 2024

MichelZ commented Oct 30, 2024

MichelZ commented Oct 30, 2024

Wraith2 commented Oct 30, 2024

Wraith2 commented Oct 30, 2024

Wraith2 commented Oct 31, 2024

MichelZ commented Oct 31, 2024

MichelZ commented Oct 31, 2024

MichelZ commented Oct 31, 2024

Wraith2 commented Oct 31, 2024

Wraith2 commented Oct 31, 2024

MichelZ commented Oct 31, 2024

Wraith2 commented Oct 31, 2024

cheenamalhotra commented Aug 6, 2024 •

edited

Loading

Wraith2 commented Aug 10, 2024 •

edited

Loading

David-Engel commented Aug 26, 2024 •

edited

Loading

saurabh500 commented Sep 4, 2024 •

edited

Loading

MichelZ commented Oct 29, 2024 •

edited

Loading

MichelZ commented Oct 30, 2024 •

edited

Loading

MichelZ commented Oct 30, 2024 •

edited

Loading

Wraith2 commented Oct 30, 2024 •

edited

Loading

MichelZ commented Oct 30, 2024 •

edited

Loading