Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Durable orchestration take a long time to start with their activities. #2813

Closed
well0549 opened this issue May 10, 2024 · 19 comments
Closed
Labels

Comments

@well0549
Copy link

Description

I have 12 durable functions that process messages from a queue. The process is quite simple,

  • Orchestration is triggered by a message on a servicebus queue
  • Several activities run then in sequence. (usually 3 or 4)
  • Finally the message is send to the next queue. (probably one of the 11 functions)

It's all consumption plan, .netcore8, isolated.

Now the problem is that the messages get consumed by the queue trigger, and put in the taskhub instances table with the status pending. And some of them stay there in that state for about 8 to 10 minutes. (with the status pending) nothing is happening.

Expected behavior

If a message is put on a queue to be processed by a durable orchestration, the function could be asleep so it could take some time to process the message. But once consumed from the queue, the orchestration should kick in immediately and start processing the activities.

If I compare this to the current 6.0 solution (not isolated) the processing is very slow.

Actual behavior

Orchestration is created and a message is inserted into the TaskHub instances with the status pending. It can take up to 10 minutes before the message is picked up for processing.

Known workarounds

none

App Details

<PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.22.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.ServiceBus" Version="5.18.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Timer" Version="4.3.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.17.3-preview1" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http.AspNetCore" Version="1.2.1" />

I also have a question about this here : #2808

But i think this is a bug because it behaves so much worse than .net 6.0

@well0549
Copy link
Author

I had some look into various diagnostics, i think it has to do with the algorithm that is in place when the function scales down. I think this is different from the standard functions (|Non isolated). It seems that when the instances go up, everything is fine. But when they scale down, I get a pause of 5 minutes on the process. This can happens also on .net 6.0 with non isolated functions. But the behaviour is much more smoother on 6.0 and I get way way less 5 minutes pauses. It seems like the scale controller will scale down when the instance is still picking up a lot of messages from the queue and putting them in the table storage with status pending. Then the pending instances have to wait for five minutes to be served. This would be logical for messages with processing, but status pending is waiting to be processed (so only if it's beeing processed ther should be a lock)..... Maybe the underlying mechanism should release the locks on pending messages first before killing the proces ?

When i put in a bunch of messages about 60 percent of them, finish fast and 30% takes a long time to complete.

Maybe there is some kql query I could run to get the information you need....

@well0549
Copy link
Author

I am willing to give the developers who will work on this access to my account so they can trigger the functions and test for themselves.

@davidmrdavid
Copy link
Contributor

Hi @well0549 - can you please confirm what DurableTask package versions you're using? It's interesting that you mentioned the scale controller, as I was just looking at another thread that made a connection between .NET isolated and scale controller issues (microsoft/durabletask-netherite#398) though that was in regards to failures. Still, trying to see if there's a connection, or if it's a coincidence.

On our end - what we need to investigate are usually:

A timerange in UTC when the issue was observed on Azure, the instanceID of the affected orchestrator, and the name of the app

@well0549
Copy link
Author

it's all c#, .net 8.0 running on windows consumption.

		<FrameworkReference Include="Microsoft.AspNetCore.App" />
		<PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.22.0" />
		<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http" Version="3.1.0" />
		<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http.AspNetCore" Version="1.2.1" />
		<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.ServiceBus" Version="5.18.0" />
		<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Timer" Version="4.3.0" />
		<PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.17.3-preview1" />
		<PackageReference Include="Microsoft.Azure.Functions.Worker.ApplicationInsights" Version="1.2.0" />
   		<PackageReference Include="Microsoft.ApplicationInsights.WorkerService" Version="2.22.0" />

I am currentl;y setting up application insights again an will provide the details soon

@well0549
Copy link
Author

well0549 commented May 23, 2024

One other thing i noticed is that the mentioned issue also assigns an instance id himself to : ScheduleNewOrchestrationInstanceAsync just like I do.

@well0549
Copy link
Author

well0549 commented May 23, 2024

@davidmrdavid -> Regarding your question regarding the version of durabletask, There is no reference anywhere to DurableTask. The references in my project are:
image

@well0549
Copy link
Author

Ok, have deployed everything again, i stopped the function apps, because the appinsight logging will fill up my subscription quite fast.

Orchestration Instance Id = 8e6ada22112f43108bc079f7e752dbfd

below an example of a flow that executed as expected:
image
The message is written to the servicebus, The orchestration is created, and almost immediately the activities start.

This is an example where the delay is visible :
image

The message is written to the servicebus, the trigger creates the orchestration and then the first activity starts after 5 minutes.
The name of the function is : eig-eigdemo-cleanup

@well0549
Copy link
Author

Here is a sample of the Timeline: The initial delay of 5 minutes is visible there

image

@well0549
Copy link
Author

Start :
image

Telling it is starting execution.....
image

Then a 5 minute pause.,and starting processing
image

@well0549
Copy link
Author

well0549 commented May 23, 2024

This is also visible in the Storage account of the function:

image

@well0549
Copy link
Author

Some more info :

image

And a happy flow :
image

@well0549
Copy link
Author

well0549 commented May 23, 2024

@davidmrdavid it seems the message is picked up twice at exactly the same moment ? Dequeue count = 2 ?
Then it will wait until the lock expires and start processing. ?

@davidmrdavid
Copy link
Contributor

Hi @well0549: I have yet to take a detailed look, but I want to follow up to this comment first: #2813 (comment)

You mentioned that you have no reference to any DurableTask packages, but in that screenshot, inside the SNFC folder, I do see a DurableTask package (with version 1.2.0-rc.1). That seems like the Durable Functions/Durable Task worker extension package, which is exactly what I wanted to see :) . As you can see by the version suffix (the 'rc'), you're using a preview package, not the GA package (which is 1.1.3). Are you able to reproduce this issue with package version 1.1.3?

@well0549
Copy link
Author

Ok i downgraded to 1.1.3. Cleaned up all the taskhubs and removed storage tables and queues. What struck me is that with the preview version i would end up with 3 tables in the storage account.

image

The last folder is one that I don't get with 1.1.3.

@well0549
Copy link
Author

well0549 commented May 24, 2024

Did several test runs with 1.1.3. Performance is more stable.

I still see some 5 minutes pauses but i think this is on par with .net6 not isolated. (I think it now only happens when the instances cale down.)

The performance is much better. And the reason to go to the preview was to clean up the orchestration history. But that works with 1.1.3 as well.

@well0549
Copy link
Author

guess this is no longer relevant, (as long as the preview is not released)

@well0549
Copy link
Author

so closed

@nytian
Copy link
Contributor

nytian commented Jun 5, 2024

Hey @well0549 I am working on verification of the performance on dotnet isolated worker with package 1.2.0-rc.1. I queried with the instanceid 8e6ada22112f43108bc079f7e752dbfd and storage account eig-eigdemo-cleanup provided, but I only found few records which is not enough for investigations. I guess you might delete the history from your storage account, so I wonder do you have some other examples for me to identify the root cause? Thank you!

@well0549
Copy link
Author

Unfortunately i have to clean up a lot because my visual studio subscription will fill up with all the storage costs (and all other costs) so i regularly delete everything and create new. Since i went back to the non release packahe everything works a lot smoother now. I am currently running Microsoft.Azure.Functions.Worker.Extensions.DurableTask 1.1.4 and besides the occasional 5 minute hick-up, i dont the the behaviour anymore that I saw on the RC. I am willing to switch back to the RC, but then I must know when you will investigate, Because because of all the costs related to all the tests i clean up regularly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants