-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RestoreTask randomly fails after upgrading to latest version #13545
Comments
We're seeing this and #13551 repeatedly in CI is there something we can do to help get this resolved |
The dotnet-public feed doesn't have upstreams. |
@nkolev92 do you know what's going on? It looks like the same symptoms where the feed lists a particular version but then NuGet receives a 404 when trying to download the package. |
Looking at the code this is the only place where PackageNotFoundProtocolException is thrown: https://github.com/NuGet/NuGet.Client/blob/20f05435be385abfe74737b6433dc80fd3b3b504/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L69-L88 Following |
Also tracking in dotnet/dnceng#3100 and dotnet/runtime#103526 |
I'll reach out offline as well, but if possible, can anyone try using .NET 9, P4 and seeing if the problems persist. |
#13551 is tracked as dotnet/runtime#103526. Retry always fixes the error so it's intermittent but consistently failing couple of builds a day. |
I worked with @embetten from the Azure Artifacts team. I was told that the
Upon checking if there are any 404s for either 'System.Text.Json.7.0.3' or 'Castle.Core.5.1.1' packages, surprisingly no records were found in telemetry and IIS/AFD logs. Emily mentioned that one potential reason why the feed responded with a 404 might be that the AzDO service was still processing the package after its push while there was a request to download it, which would have failed with a 404. For example, the 'System.Text.Json.7.0.3' package was ingested into the feed on 06/13, and the issue was reported on the same day. |
dotnet/sdk#41590 lists recent failures, in particular: https://dev.azure.com/dnceng-public/public/_build/results?buildId=711675&view=logs&j=5960db8e-22cf-5780-65a0-f69bf3efbd20&t=663b2309-e924-59c7-4a84-d59cf2e17f86&l=1171 for example:
However it was publishing on 3/20, https://dev.azure.com/dnceng/public/_artifacts/feed/dotnet9/NuGet/Roslyn.Diagnostics.Analyzers/overview/3.11.0-beta1.24170.2. So I don't think timing is the likely issue. |
Yeah, it is the same case with 'Castle.Core.5.1.1' package for |
Hey everyone, I've done some more investigations with some help from @zivkan and @jeffkl, and while we don't have anything conclusive we do have a way forward. Investigation summary: As mentioned a few times, currently what's getting thrown is a PackageNotFoundException. Good news is that there's a lot of logging. We expect that 404s might be happen, and NuGet does log all http results, but it does so at normal verbosity. I did consider adding additional messages, but I believe there's plenty of information in the current logs (once we get the higher verbosity logs), to make further progress. Call to action
Beyond this, it'd also be helpful to know things like: |
@nkolev92 I'm having trouble figuring out how to increase the log verbosity for the RestoreTask, would you mind showing which property you had in mind? Note that I already posted the binlogs in the original comment in case you missed it. |
I looked at the binlogs again and I don't see a GET request for the packages which fail later on, only for the index.json. |
There's no special property, the cli argument should work. |
so you're saying this will emit more data into the binlog? |
No, my bad I should've been explicit. I was just answering that part of the question. Checking now to see if I can understand how the Get/Cache message would never happen and yet we'd get to throw that exception. |
The only way I could see the log message not showing and not have another message such as cancellation or a retry one, is if it somehow exited before hitting this code path: https://github.com/NuGet/NuGet.Client/blob/d1d2e260de9b8f20175e7766aa88e1ce1ece6b4e/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L258-L263. I think we'll probably need to add some custom logging here. |
I may meet the same issue, success with a retry CI: https://weihanli.visualstudio.com/Pipelines/_build/results?buildId=6428&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=7dae026f-9e99-5d65-075e-3f7579577f94&s=96ac2280-8cb4-5df5-99de-dd2da759617d |
@nkolev92 is this being worked on? We can do some runs on AzDO with a patched nuget.client .dll if necessary |
Kindly asking for an update on this issue. This is still affecting most of our builds.
So roughly about 100 hits in the last 5 days. |
I was out unexpectedly all of last week. |
Catching up here, @ViktorHofer, I think dotnet/runtime#103823 looks like it's a runtime issue and something likely unrelated to this particular problem, so I'm not tracking that one as a duplicate of this. |
Yep, we'll have to do that. |
Update: |
When multiple threads try to write to the fields we can sometimes hit a case where _retryCount is set to 0. This caused the loop in ProcessHttpSourceResultAsync() to be skipped completely because `retry <= maxRetries`: https://github.com/NuGet/NuGet.Client/blob/6c642c2d63717acd4bf92050a42691928020eb89/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L258-L328 Fix this by using `Lazy<T>` for initializing the fields which is thread-safe. Fixes NuGet/Home#13545
We figured out the root cause: NuGet/NuGet.Client#5905 I ran three |
When multiple threads try to access the `RetryCount` property we sometimes hit a case where `RetryCount` returned 0 (*). This caused the loop in `ProcessHttpSourceResultAsync()` to be skipped completely because `retry <= maxRetries`: https://github.com/NuGet/NuGet.Client/blob/6c642c2d63717acd4bf92050a42691928020eb89/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L258-L328 Fix this by using `Lazy<T>` for initializing the fields which is thread-safe. Fixes NuGet/Home#13545 (*) The reason is that code like `int RetryCount => _retryCount ??= 6;` gets turned into: ```csharp int valueOrDefault = _retryCount.GetValueOrDefault(); if (!_retryCount.HasValue) { valueOrDefault = 6; _retryCount = valueOrDefault; return valueOrDefault; } return valueOrDefault; ``` Suppose Thread A arrives first and calls `GetValueOrDefault()` (which is 0 for int) but then Thread B interjects and sets the value, now when Thread A resumes `_retryCount.HasValue` is true so we skip the if block and return valueOrDefault i.e. 0.
Some more context for the followers of this issue:
|
NuGet Product Used
dotnet.exe
Product Version
6.11.0-rc.90
Worked before?
NuGet in dotnet sdk 9.0-preview4
Impact
It's more difficult to complete my work
Repro Steps & Context
Since we updated to a dotnet 9.0-preview6 SDK we started seeing RestoreTask failing randomly in the VMR build: dotnet/sdk#41477 (comment)
After the fix in NuGet/NuGet.Client#5845 we bumped again (i.e. we're now using a very recent nuget.client) we now see these messages
or
I've attached some binlogs: binlogs.zip
It's happening for random packages and given we only started seeing this when bumping the dotnet SDK (and thus nuget) I think it's more likely a nuget bug than an issue with the AzDO feed.
Is there some way we can enable more logging?
/cc @nkolev92 @kartheekp-ms
Verbose Logs
No response
The text was updated successfully, but these errors were encountered: