ScheduleNewOrchestrationInstanceAsync fails when scale controller scaling in #398

jason-daly · 2024-05-14T05:08:18Z

I've recently started using Netherite (Microsoft.Azure.Functions.Worker.Extensions.DurableTask.Netherite" Version="1.5.1") and have been seeing a few errors when starting orchestrations.

Orchestrations start fine most of the time, but occasionally I see the following error:

When starting an orchestration, as follows:

            var ingestionOrchestrationSpec = new IngestionOrchestrationSpec(etc...);
            var orchestrationInstanceId = $"ingestions.{tenant.ProjectId}.{pointcloud.Id}";
            await client.ScheduleNewOrchestrationInstanceAsync(nameof(IngestionsCreateAsyncOrchestrator), ingestionOrchestrationSpec, new StartOrchestrationOptions { InstanceId = orchestrationInstanceId });

This exception is thrown:

Exception: Grpc.Core.RpcException: Status(StatusCode="Unknown", Detail="Exception was thrown by handler.")
   at Microsoft.DurableTask.Client.Grpc.GrpcDurableTaskClient.ScheduleNewOrchestrationInstanceAsync(TaskName orchestratorName, Object input, StartOrchestrationOptions options, CancellationToken cancellation)
   at TrimbleGeospatial.Seurat.ApiApp.Functions.Ingestions_Create.Run(HttpRequestData req, DurableTaskClient client, FunctionContext functionContext, String projectId, String pointcloudId)

The text was updated successfully, but these errors were encountered:

jason-daly · 2024-05-14T05:14:42Z

The relevant host.json settings are:

    "durableTask": {
      "hubName": "SeuratAPIDurableTasks",
      "UseGracefulShutdown": "true",
      "storageProvider": {
        "type": "Netherite",
        "PartitionCount": "12",
        "StorageConnectionName": "AzureWebJobsStorage",
        "EventHubsConnectionName": "EventHubsConnection"
      },
      "tracing": {
        "distributedTracingEnabled": true,
        "Version": "V2"
      },

      "TraceToBlob": true
    }

where "AzureWebJobsStorage" is a connection string to a Azure storage account:
DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=srtfuncdevaue;AccountKey=this_part_is_secret

and "EventHubsConnection" is a connection string to an EventHub, as follows:
Endpoint=sb://evhns-srt-netherite-dev-aue.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=secret_goes_here

jason-daly · 2024-05-15T22:47:17Z

more info...

Logs show that the error is raised in response to the host instance being shut down by the Scale Controller, but it appears that the orchestration actually started ok.

15/05/2024, 5:05:56.497 pm Scheduling new IngestionsCreateAsyncOrchestrator orchestration with instance ID 'ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873' and 3140 bytes of input data.
15/05/2024, 5:05:56.498 pm {"durabletask.task.execution_id":"b9e17ec1fb2a4464ae5ec25fbb3265cc","durabletask.task.name":"IngestionsCreateAsyncOrchestrator","durabletask.type":"orchestration","otel.status_description":"Client request was cancelled because host is shutting down.","durabletask.task.instance_id":"ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873"}

15/05/2024, 5:05:56.504 pm ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873: Function 'IngestionsCreateAsyncOrchestrator (Orchestrator)' scheduled. Reason: NewInstance. IsReplay: False. State: Scheduled. RuntimeStatus: Pending. HubName: SeuratAPIDurableTasks. AppName: func-srt-dev-aue. SlotName: Production. ExtensionVersion: 2.13.2. SequenceNumber: 21.

15/05/2024, 5:05:56.550 pm An instance was removed because all functions are either idle or seeing a steady decrease in load.
15/05/2024, 5:05:56.550 pm Instance count changed

15/05/2024, 5:05:56.551 pm ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873: Function 'IngestionsCreateAsyncOrchestrator (Orchestrator)' started. IsReplay: False. Input: (12560 bytes). State: Started. RuntimeStatus: Running. HubName: SeuratAPIDurableTasks. AppName: func-srt-dev-aue. SlotName: Production. ExtensionVersion: 2.13.2. SequenceNumber: 110. TaskEventId: -1

15/05/2024, 5:05:56.551 pm Executing 'Functions.IngestionsCreateAsyncOrchestrator' (Reason='(null)', Id=21415206-bfc1-446e-8e0c-287674404a50)
15/05/2024, 5:05:56.557 pm Executed 'Functions.IngestionsCreateAsyncOrchestrator' (Succeeded, Id=21415206-bfc1-446e-8e0c-287674404a50, Duration=5ms)

15/05/2024, 5:05:56.572 pm DrainMode mode enabled
15/05/2024, 5:05:56.572 pm Calling StopAsync on the registered listeners
15/05/2024, 5:05:56.590 pm Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableTaskListener' for function 'IngestionsCreateAsyncOrchestrator'
15/05/2024, 5:05:56.590 pm Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableTaskListener' for function 'IngestionsCreateAsyncOrchestrator'
... (I can supply more logs if needed)

15/05/2024, 5:05:56.888 pm Exception: Grpc.Core.RpcException: Status(StatusCode="Unknown", Detail="Exception was thrown by handler.")
at Microsoft.DurableTask.Client.Grpc.GrpcDurableTaskClient.ScheduleNewOrchestrationInstanceAsync(TaskName orchestratorName, Object input, StartOrchestrationOptions options, CancellationToken cancellation)

15/05/2024, 5:05:56.888 pm Failed to start IngestionsCreateAsyncOrchestrator with orchestrationInstanceId=ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873

jason-daly · 2024-05-15T22:53:03Z

This seems to me to be similar to 2454, as explained in this comment.

davidmrdavid · 2024-05-22T16:44:34Z

Hi @jason-daly: can you share what Durable Functions packages you're suing? As suggested in the thread you linked, that interaction between the scale controller and DF should have been fixed.

You also mentioned that you encountered this error since moving to Netherite 1.5.1. What were you using before?

davidmrdavid · 2024-05-22T16:47:51Z

Also - I think the thread you linked mostly discussed orchestrators being marked as failed when scale controller removes a VM from the app. In this case, it appears to me that the client function is failing, not the orchestrator. Do I have that right?

If so - does your client function have a cancellation token that you can inspect during these shutdowns? I suspect the cancellation token has been "set"/ fired.

jason-daly · 2024-05-22T21:53:13Z

Also - I think the thread you linked mostly discussed orchestrators being marked as failed when scale controller removes a VM from the app. In this case, it appears to me that the client function is failing, not the orchestrator. Do I have that right?

If so - does your client function have a cancellation token that you can inspect during these shutdowns? I suspect the cancellation token has been "set"/ fired.

Ah, yes, you are correct. The exception occurs in the call to ScheduleNewOrchestrationInstanceAsync on the DurableTaskClient instance.

In the latest edit of this code, a CancellationToken was supplied, and I still encountered the RpcException, although the underlying error logged by the 'framework' changed to:

Hi @jason-daly: can you share what Durable Functions packages you're suing? As suggested in the thread you linked, that interaction between the scale controller and DF should have been fixed.

You also mentioned that you encountered this error since moving to Netherite 1.5.1. What were you using before?

Prior to Netherite, we have had been using the default storage provider, Azure Storage, and had not encountered these failures.

Here's the packages list:

    <PackageReference Include="AutoMapper" Version="13.0.1" />
    <PackageReference Include="Azure.Identity" Version="1.11.3" />
    <PackageReference Include="Azure.Messaging.EventGrid" Version="4.24.0" />
    <PackageReference Include="Azure.Storage.Files.DataLake" Version="12.18.0" />
    <PackageReference Include="Azure.Storage.Queues" Version="12.18.0" />
    <PackageReference Include="Gridify" Version="2.14.1" />
    <PackageReference Include="Microsoft.Azure.Core.NewtonsoftJson" Version="2.0.0" />
    <PackageReference Include="Microsoft.Azure.Cosmos" Version="3.40.0-preview.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.CosmosDB" Version="4.8.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask" Version="1.1.3" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask.Netherite" Version="1.5.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.EventGrid" Version="3.4.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http" Version="3.1.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Storage.Queues" Version="5.4.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Timer" Version="4.3.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.17.2" OutputItemType="Analyzer" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.22.0" />
    <PackageReference Include="Microsoft.Extensions.Diagnostics.HealthChecks.Abstractions" Version="8.0.5" />
    <PackageReference Include="Microsoft.IdentityModel.Tokens" Version="7.5.2" />
    <PackageReference Include="Microsoft.IO.RecyclableMemoryStream" Version="3.0.0" />
    <PackageReference Include="System.Linq.Async" Version="6.0.1" />
    <PackageReference Include="System.Net.Http" Version="4.3.4" />
    <PackageReference Include="System.Text.RegularExpressions" Version="4.3.1" />

jason-daly · 2024-05-22T21:56:36Z

BTW, I have backed out Netherite now, and gone back to Azure Storage. If you are able to figure out the issue and get it resolved, I can give Netherite another try.

davidmrdavid · 2024-05-23T00:15:49Z

@jason-daly: are you able to provide me with your app name and time in UTC where this issue ocurred in Azure? I'd need that to be able to debug this further

jason-daly · 2024-05-23T01:00:34Z

@jason-daly: are you able to provide me with your app name and time in UTC where this issue ocurred in Azure? I'd need that to be able to debug this further

Sure. The function app name is: func-srt-dev-aue (in Australia East region); logs are with Application Insights 'appi-srt-dev-aue'.

The most recent occurrence was at 2024-05-16T02:58:06.6515245Z.
See operation_id 27ab279dcd3d61ccdd43934ec76717fe
See orchestration instanceId 'FeaturesServiceIntegration.nE0Z_35yWYM.82da1a0b-2812-4fee-9ca0-314369d386af.38d7f15b-99a5-4553-8a02-17557f50c1c8'.

To be honest, I kind of gave up on this already and reverted Netherite from our code. When I did this I also cleaned up the resources that it used, so this might hamper your troubleshooting efforts a bit...
If you are unable to figure out what was going wrong from the logs, I can temporarily reinstate Netherite in this app, so we can repeat the behavior. Let me know if you need this.

davidmrdavid · 2024-05-23T01:10:43Z

Thanks. No worries, we don't need your app to be active for us to review our telemetry, it is historical, not live :)

microsoft-github-policy-service bot added the Needs: Triage 🔍 label May 14, 2024

jason-daly changed the title ~~ScheduleNewOrchestrationInstanceAsync fails when multiple orchestrations are running~~ ScheduleNewOrchestrationInstanceAsync fails when scale controller scaling in May 15, 2024

bachuv added P2 Priority 2 and removed Needs: Triage 🔍 labels May 22, 2024

davidmrdavid mentioned this issue May 22, 2024

Durable orchestration take a long time to start with their activities. Azure/azure-functions-durable-extension#2813

Closed

davidmrdavid added the needs investigation label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ScheduleNewOrchestrationInstanceAsync fails when scale controller scaling in #398

ScheduleNewOrchestrationInstanceAsync fails when scale controller scaling in #398

jason-daly commented May 14, 2024 •

edited

Loading

jason-daly commented May 14, 2024 •

edited

Loading

jason-daly commented May 15, 2024

jason-daly commented May 15, 2024

davidmrdavid commented May 22, 2024

davidmrdavid commented May 22, 2024 •

edited

Loading

jason-daly commented May 22, 2024 •

edited

Loading

jason-daly commented May 22, 2024

davidmrdavid commented May 23, 2024

jason-daly commented May 23, 2024

davidmrdavid commented May 23, 2024

ScheduleNewOrchestrationInstanceAsync fails when scale controller scaling in #398

ScheduleNewOrchestrationInstanceAsync fails when scale controller scaling in #398

Comments

jason-daly commented May 14, 2024 • edited Loading

jason-daly commented May 14, 2024 • edited Loading

jason-daly commented May 15, 2024

jason-daly commented May 15, 2024

davidmrdavid commented May 22, 2024

davidmrdavid commented May 22, 2024 • edited Loading

jason-daly commented May 22, 2024 • edited Loading

jason-daly commented May 22, 2024

davidmrdavid commented May 23, 2024

jason-daly commented May 23, 2024

davidmrdavid commented May 23, 2024

jason-daly commented May 14, 2024 •

edited

Loading

jason-daly commented May 14, 2024 •

edited

Loading

davidmrdavid commented May 22, 2024 •

edited

Loading

jason-daly commented May 22, 2024 •

edited

Loading