Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: behavior for large messages and/or histories #198

Open
sebastianburckhardt opened this issue Nov 3, 2022 · 4 comments
Open

Question: behavior for large messages and/or histories #198

sebastianburckhardt opened this issue Nov 3, 2022 · 4 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@sebastianburckhardt
Copy link
Member

Netherite does not impose any hard limits on the size of messages or histories. But of course, the question remains what happens as messages or histories get very large, i.e. what breaks first. I created this issue to track discussion, testing, and documentation around this question.

Some thoughts on this:

  • Everything in the system (i.e. not only the specific orchestrations which contain large messages or histories) is likely to slow down substantially when storage bandwidth, or inter-partition bandwidth (e.g. event hub throughput) becomes a bottleneck. The system should keep working under such circumstances but may be too slow to meet its intended purpose.

  • All in-flight messages are kept in memory (in the outboxes and the session buffers on the source and destination partitions respectively), so we may hit OOM when using large messages and not processing them quickly.

  • Netherite keeps the in-memory instance caching in line with the specified cache limits. If workers need to process histories or messages that exceed the memory limits of the cache, the result is thrashing which makes a system perform horribly. It is therefore important to increase the cache size when trying to handle such situations.

  • Page blobs have a maximum size of 1TB, and all of the data in the task hub partition, including all instance states, histories, and in-flight messages, have to fit into that. Also, the FASTER log can be quite a bit larger than the data it stores because it may contain multiple versions of orchestration states. FASTER does run compaction periodically but it remains to be determined what expansion factors would be typical. I would guess something like 3x.

A lot of that is just my guesses, we need experiments to validate these statements.

@sebastianburckhardt sebastianburckhardt added documentation Improvements or additions to documentation question Further information is requested labels Nov 3, 2022
@ericleigh007
Copy link

it is worth noting now that V1.4.0 has a much more efficient algorithm for handling large messages, I'm sure @sebastianburckhardt will attest to.

@gha-zund
Copy link

does the event hub message size limit of 1 MB plays a role for large inputs (entity states)?
In our application, it's not uncommon for the state to grow near to (or even higher than) 1MB in size...

@ericleigh007
Copy link

@sebastianburckhardt is correct in saying that large messages (the data that is serialized between the event hub and orchestrators and activities) does certainly affect latency.

In my tests with an actual application (not a benchmark, but real-world stuff) we trigger off changes in cosmos, but then for expediency we have to gel together several documents. Our messages can balloon to many megabytes.
Then we have a choice of either slowing down because of the event hubs transit time, or going quicker based on the storage blobs "lookaside" storage, but then again increasing the load on the durable functions Netherite storage account that contains the taskhub.

We also have some experience with querying large histories and there we have found that status and purge history queries can take a good deal of time. To combat this, we had to place such queries out of the critical time windows within our system. We were also concerned what a large status or history query or purge would do to the utilization on our taskhub storage account and we had some scant evidence that suggested that these operations interfered with the "real time" work of running orchestrators and activities.

@cticevans
Copy link

what's the easiest way to see size of history? any notion of "large" is? KB, MB, GB?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants