Durable orchestration take a long time to start with their activities. #2813

well0549 · 2024-05-10T13:46:26Z

Description

I have 12 durable functions that process messages from a queue. The process is quite simple,

Orchestration is triggered by a message on a servicebus queue
Several activities run then in sequence. (usually 3 or 4)
Finally the message is send to the next queue. (probably one of the 11 functions)

It's all consumption plan, .netcore8, isolated.

Now the problem is that the messages get consumed by the queue trigger, and put in the taskhub instances table with the status pending. And some of them stay there in that state for about 8 to 10 minutes. (with the status pending) nothing is happening.

Expected behavior

If a message is put on a queue to be processed by a durable orchestration, the function could be asleep so it could take some time to process the message. But once consumed from the queue, the orchestration should kick in immediately and start processing the activities.

If I compare this to the current 6.0 solution (not isolated) the processing is very slow.

Actual behavior

Orchestration is created and a message is inserted into the TaskHub instances with the status pending. It can take up to 10 minutes before the message is picked up for processing.

Known workarounds

none

App Details

<PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.22.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.ServiceBus" Version="5.18.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Timer" Version="4.3.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.17.3-preview1" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http.AspNetCore" Version="1.2.1" />

I also have a question about this here : #2808

But i think this is a bug because it behaves so much worse than .net 6.0

The text was updated successfully, but these errors were encountered:

well0549 · 2024-05-13T08:38:42Z

I had some look into various diagnostics, i think it has to do with the algorithm that is in place when the function scales down. I think this is different from the standard functions (|Non isolated). It seems that when the instances go up, everything is fine. But when they scale down, I get a pause of 5 minutes on the process. This can happens also on .net 6.0 with non isolated functions. But the behaviour is much more smoother on 6.0 and I get way way less 5 minutes pauses. It seems like the scale controller will scale down when the instance is still picking up a lot of messages from the queue and putting them in the table storage with status pending. Then the pending instances have to wait for five minutes to be served. This would be logical for messages with processing, but status pending is waiting to be processed (so only if it's beeing processed ther should be a lock)..... Maybe the underlying mechanism should release the locks on pending messages first before killing the proces ?

When i put in a bunch of messages about 60 percent of them, finish fast and 30% takes a long time to complete.

Maybe there is some kql query I could run to get the information you need....

well0549 · 2024-05-22T06:17:48Z

I am willing to give the developers who will work on this access to my account so they can trigger the functions and test for themselves.

davidmrdavid · 2024-05-22T17:07:02Z

Hi @well0549 - can you please confirm what DurableTask package versions you're using? It's interesting that you mentioned the scale controller, as I was just looking at another thread that made a connection between .NET isolated and scale controller issues (microsoft/durabletask-netherite#398) though that was in regards to failures. Still, trying to see if there's a connection, or if it's a coincidence.

On our end - what we need to investigate are usually:

A timerange in UTC when the issue was observed on Azure, the instanceID of the affected orchestrator, and the name of the app

well0549 · 2024-05-23T05:58:05Z

it's all c#, .net 8.0 running on windows consumption.

		<FrameworkReference Include="Microsoft.AspNetCore.App" />
		<PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.22.0" />
		<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http" Version="3.1.0" />
		<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http.AspNetCore" Version="1.2.1" />
		<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.ServiceBus" Version="5.18.0" />
		<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Timer" Version="4.3.0" />
		<PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.17.3-preview1" />
		<PackageReference Include="Microsoft.Azure.Functions.Worker.ApplicationInsights" Version="1.2.0" />
   		<PackageReference Include="Microsoft.ApplicationInsights.WorkerService" Version="2.22.0" />

I am currentl;y setting up application insights again an will provide the details soon

well0549 · 2024-05-23T06:01:34Z

One other thing i noticed is that the mentioned issue also assigns an instance id himself to : ScheduleNewOrchestrationInstanceAsync just like I do.

well0549 · 2024-05-23T06:40:02Z

@davidmrdavid -> Regarding your question regarding the version of durabletask, There is no reference anywhere to DurableTask. The references in my project are:

well0549 · 2024-05-23T08:19:31Z

Ok, have deployed everything again, i stopped the function apps, because the appinsight logging will fill up my subscription quite fast.

Orchestration Instance Id = 8e6ada22112f43108bc079f7e752dbfd

below an example of a flow that executed as expected:

The message is written to the servicebus, The orchestration is created, and almost immediately the activities start.

This is an example where the delay is visible :

The message is written to the servicebus, the trigger creates the orchestration and then the first activity starts after 5 minutes.
The name of the function is : eig-eigdemo-cleanup

well0549 · 2024-05-23T08:30:53Z

Here is a sample of the Timeline: The initial delay of 5 minutes is visible there

well0549 · 2024-05-23T08:37:28Z

Start :

Telling it is starting execution.....

Then a 5 minute pause.,and starting processing

well0549 · 2024-05-23T09:08:07Z

This is also visible in the Storage account of the function:

well0549 · 2024-05-23T14:24:29Z

Some more info :

And a happy flow :

well0549 · 2024-05-23T14:26:49Z

@davidmrdavid it seems the message is picked up twice at exactly the same moment ? Dequeue count = 2 ?
Then it will wait until the lock expires and start processing. ?

davidmrdavid · 2024-05-23T17:12:19Z

Hi @well0549: I have yet to take a detailed look, but I want to follow up to this comment first: #2813 (comment)

You mentioned that you have no reference to any DurableTask packages, but in that screenshot, inside the SNFC folder, I do see a DurableTask package (with version 1.2.0-rc.1). That seems like the Durable Functions/Durable Task worker extension package, which is exactly what I wanted to see :) . As you can see by the version suffix (the 'rc'), you're using a preview package, not the GA package (which is 1.1.3). Are you able to reproduce this issue with package version 1.1.3?

well0549 · 2024-05-24T12:30:58Z

Ok i downgraded to 1.1.3. Cleaned up all the taskhubs and removed storage tables and queues. What struck me is that with the preview version i would end up with 3 tables in the storage account.

The last folder is one that I don't get with 1.1.3.

well0549 · 2024-05-24T13:33:32Z

Did several test runs with 1.1.3. Performance is more stable.

I still see some 5 minutes pauses but i think this is on par with .net6 not isolated. (I think it now only happens when the instances cale down.)

The performance is much better. And the reason to go to the preview was to clean up the orchestration history. But that works with 1.1.3 as well.

well0549 · 2024-05-27T07:07:22Z

guess this is no longer relevant, (as long as the preview is not released)

well0549 · 2024-05-27T07:07:43Z

so closed

nytian · 2024-06-05T19:11:13Z

Hey @well0549 I am working on verification of the performance on dotnet isolated worker with package 1.2.0-rc.1. I queried with the instanceid 8e6ada22112f43108bc079f7e752dbfd and storage account eig-eigdemo-cleanup provided, but I only found few records which is not enough for investigations. I guess you might delete the history from your storage account, so I wonder do you have some other examples for me to identify the root cause? Thank you!

well0549 · 2024-06-10T07:37:25Z

Unfortunately i have to clean up a lot because my visual studio subscription will fill up with all the storage costs (and all other costs) so i regularly delete everything and create new. Since i went back to the non release packahe everything works a lot smoother now. I am currently running Microsoft.Azure.Functions.Worker.Extensions.DurableTask 1.1.4 and besides the occasional 5 minute hick-up, i dont the the behaviour anymore that I saw on the RC. I am willing to switch back to the RC, but then I must know when you will investigate, Because because of all the costs related to all the tests i clean up regularly

microsoft-github-policy-service bot added the Needs: Triage 🔍 label May 10, 2024

bachuv added Performance P2 Priority 2 and removed Needs: Triage 🔍 labels May 22, 2024

well0549 closed this as completed May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Durable orchestration take a long time to start with their activities. #2813

Durable orchestration take a long time to start with their activities. #2813

well0549 commented May 10, 2024

well0549 commented May 13, 2024

well0549 commented May 22, 2024

davidmrdavid commented May 22, 2024

well0549 commented May 23, 2024

well0549 commented May 23, 2024 •

edited

Loading

well0549 commented May 23, 2024 •

edited

Loading

well0549 commented May 23, 2024

well0549 commented May 23, 2024

well0549 commented May 23, 2024

well0549 commented May 23, 2024 •

edited

Loading

well0549 commented May 23, 2024

well0549 commented May 23, 2024 •

edited

Loading

davidmrdavid commented May 23, 2024

well0549 commented May 24, 2024

well0549 commented May 24, 2024 •

edited

Loading

well0549 commented May 27, 2024

well0549 commented May 27, 2024

nytian commented Jun 5, 2024

well0549 commented Jun 10, 2024

Durable orchestration take a long time to start with their activities. #2813

Durable orchestration take a long time to start with their activities. #2813

Comments

well0549 commented May 10, 2024

Description

Expected behavior

Actual behavior

Known workarounds

App Details

well0549 commented May 13, 2024

well0549 commented May 22, 2024

davidmrdavid commented May 22, 2024

well0549 commented May 23, 2024

well0549 commented May 23, 2024 • edited Loading

well0549 commented May 23, 2024 • edited Loading

well0549 commented May 23, 2024

well0549 commented May 23, 2024

well0549 commented May 23, 2024

well0549 commented May 23, 2024 • edited Loading

well0549 commented May 23, 2024

well0549 commented May 23, 2024 • edited Loading

davidmrdavid commented May 23, 2024

well0549 commented May 24, 2024

well0549 commented May 24, 2024 • edited Loading

well0549 commented May 27, 2024

well0549 commented May 27, 2024

nytian commented Jun 5, 2024

well0549 commented Jun 10, 2024

well0549 commented May 23, 2024 •

edited

Loading

well0549 commented May 23, 2024 •

edited

Loading

well0549 commented May 23, 2024 •

edited

Loading

well0549 commented May 23, 2024 •

edited

Loading

well0549 commented May 24, 2024 •

edited

Loading