-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ScheduleNewOrchestrationInstanceAsync fails when scale controller scaling in #398
Comments
The relevant host.json settings are:
where "AzureWebJobsStorage" is a connection string to a Azure storage account: and "EventHubsConnection" is a connection string to an EventHub, as follows: |
more info... Logs show that the error is raised in response to the host instance being shut down by the Scale Controller, but it appears that the orchestration actually started ok. 15/05/2024, 5:05:56.497 pm Scheduling new IngestionsCreateAsyncOrchestrator orchestration with instance ID 'ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873' and 3140 bytes of input data. 15/05/2024, 5:05:56.504 pm ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873: Function 'IngestionsCreateAsyncOrchestrator (Orchestrator)' scheduled. Reason: NewInstance. IsReplay: False. State: Scheduled. RuntimeStatus: Pending. HubName: SeuratAPIDurableTasks. AppName: func-srt-dev-aue. SlotName: Production. ExtensionVersion: 2.13.2. SequenceNumber: 21. 15/05/2024, 5:05:56.550 pm An instance was removed because all functions are either idle or seeing a steady decrease in load. 15/05/2024, 5:05:56.551 pm ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873: Function 'IngestionsCreateAsyncOrchestrator (Orchestrator)' started. IsReplay: False. Input: (12560 bytes). State: Started. RuntimeStatus: Running. HubName: SeuratAPIDurableTasks. AppName: func-srt-dev-aue. SlotName: Production. ExtensionVersion: 2.13.2. SequenceNumber: 110. TaskEventId: -1 15/05/2024, 5:05:56.551 pm Executing 'Functions.IngestionsCreateAsyncOrchestrator' (Reason='(null)', Id=21415206-bfc1-446e-8e0c-287674404a50) 15/05/2024, 5:05:56.572 pm DrainMode mode enabled 15/05/2024, 5:05:56.888 pm Exception: Grpc.Core.RpcException: Status(StatusCode="Unknown", Detail="Exception was thrown by handler.") 15/05/2024, 5:05:56.888 pm Failed to start IngestionsCreateAsyncOrchestrator with orchestrationInstanceId=ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873 |
This seems to me to be similar to 2454, as explained in this comment. |
Hi @jason-daly: can you share what Durable Functions packages you're suing? As suggested in the thread you linked, that interaction between the scale controller and DF should have been fixed. You also mentioned that you encountered this error since moving to Netherite 1.5.1. What were you using before? |
Also - I think the thread you linked mostly discussed orchestrators being marked as failed when scale controller removes a VM from the app. In this case, it appears to me that the client function is failing, not the orchestrator. Do I have that right? If so - does your client function have a cancellation token that you can inspect during these shutdowns? I suspect the cancellation token has been "set"/ fired. |
Ah, yes, you are correct. The exception occurs in the call to ScheduleNewOrchestrationInstanceAsync on the DurableTaskClient instance. In the latest edit of this code, a CancellationToken was supplied, and I still encountered the RpcException, although the underlying error logged by the 'framework' changed to:
Prior to Netherite, we have had been using the default storage provider, Azure Storage, and had not encountered these failures. Here's the packages list:
|
BTW, I have backed out Netherite now, and gone back to Azure Storage. If you are able to figure out the issue and get it resolved, I can give Netherite another try. |
@jason-daly: are you able to provide me with your app name and time in UTC where this issue ocurred in Azure? I'd need that to be able to debug this further |
Sure. The function app name is: func-srt-dev-aue (in Australia East region); logs are with Application Insights 'appi-srt-dev-aue'. The most recent occurrence was at 2024-05-16T02:58:06.6515245Z. To be honest, I kind of gave up on this already and reverted Netherite from our code. When I did this I also cleaned up the resources that it used, so this might hamper your troubleshooting efforts a bit... |
Thanks. No worries, we don't need your app to be active for us to review our telemetry, it is historical, not live :) |
I've recently started using Netherite (Microsoft.Azure.Functions.Worker.Extensions.DurableTask.Netherite" Version="1.5.1") and have been seeing a few errors when starting orchestrations.
Orchestrations start fine most of the time, but occasionally I see the following error:
When starting an orchestration, as follows:
This exception is thrown:
The text was updated successfully, but these errors were encountered: