Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid resources lock contention utilizing channel #629

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mpelekh
Copy link

@mpelekh mpelekh commented Oct 4, 2024

Problem statement is in argoproj/argo-cd#8172 (comment)

The IterateHierrchyV2 significantly improved performance, getting us ~90% of the way there. But on huge clusters, we still have significant lock contention.

The fix in this pull request approaches the problem differently - it avoids lock contention by utilizing a channel to process events from the cluster.

More details are in the comments.

@mpelekh
Copy link
Author

mpelekh commented Oct 4, 2024

The issue

In large clusters where Argo CD monitors numerous resources, the processing of watches becomes significantly slow—in our case (total k8s resources in cluster: ~400k, Pods: ~76k, ReplicaSets: ~52k), taking around 10 minutes. As a result, the Argo CD UI displays outdated information, impacting several features reliant on sync waves, like PruneLast. Eventually, the sheer volume of events from the cluster overwhelmed the system, causing Argo CD to stall completely.

To address this, we disabled the tracking of Pods and ReplicaSets, although this compromises one of the main benefits of the Argo CD UI. We also filtered out irrelevant events and tried to optimize various settings in the application controller. However, vertical scaling of the application controller had no effect, and horizontal scaling is not an option for a single cluster due to sharding limitations.

Issue causes

During the issue investigation, it was found that the problem lies in the following:

  • resource lock contention
  • slow performance of IterateHierarchy

Patched v2.10.9

v2.10.9 was patched with the following commits.

Though patches significantly improve performance, Argo CD still can not handle the load from large clusters.

In the screenshot, you can see one of the largest clusters. Here, the patched with the above commits v2.10.9 build is running.

  • till 12:50, pods and replica sets are disabled from tracking
  • from 12:50 to 13:34, pods and replica sets are enabled to be tracked
  • after 13:34, pods and replica sets are disabled from tracking

As can be seen, once pods and rs are enabled to be tracked, the cluster event count falls close to zero, and reconciliation time increases drastically.

Screenshot 2024-08-09 at 20 40 44

Screenshot 2024-08-09 at 20 51 00

Number of pods in cluster: ~76k
Number of rs in cluster: ~52k

A more detailed comparison of different patched versions is added to this comment - argoproj/argo-cd#8172 (comment)

The potential reason is lock contention.

Here, a few more metrics were added, and it was found that when the number of events is significant, sometimes it takes ~5 minutes to acquire a lock, which leads to a delay in reconciliation.
mpelekh@560ef00#diff-9c9e197d543705f08c9b1bc2dc404a55506cfc2935a988e6007d248257aadb1aR1372

Screenshot 2024-08-09 at 21 11 33

The suggested fix #602 to optimize the lock usage has not improved the issue in large clusters.

Avoid resources lock contention utilizing channel

Since we still have significant lock contention in massive clusters, and the approaches above didn’t resolve the issue, another approach has been considered. It is a part of this PR.

When we must acquire a write lock in each goroutine, we can’t handle more than one event at a time. What if we introduce the channel where all the received events are sent, and one goroutine is responsible for processing events in batch from the channel? In such a way, the locks from each goroutine are moved to the goroutine, which processes events from the channel. This means we would have only one place where the write lock is acquired; in such a way, we would get rid of the lock contention.

The fix results

2a758ca7-9f86-4c98-b0c7-4bf81e6b91e7

As can be seen from metrics, once the fixed version was deployed and Node, ReplicaSets, and Pods were enabled for tracking, the number of cluster events was stable and didn’t go down.

Conclusions
The fix shows significant performance improvements. We left Nodes, ReplicaSets, and Pods enabled on large clusters.
ArgoCD UI is working smoothly.
The original issue has been resolved - users can manage Pods and ReplicaSets on large clusters.

@mpelekh mpelekh marked this pull request as ready for review October 4, 2024 08:11
@crenshaw-dev
Copy link
Member

Your analysis is excruciatingly thorough, I love it! I've posted it to SIG Scalability, and we'll start analyzing ASAP. Please be patient, it'll take us a while to give it a really thorough review.

@@ -864,6 +910,8 @@ func (c *clusterCache) sync() error {
return err
}

go c.processEvents()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this in the sync function comment:

// When this function exits, the cluster cache is up to date, and the appropriate resources are being watched for
// changes.

If I understand this change correctly (and the associated test changes), by processing these events in a goroutine, we're breaking the guarantee that sync will completely update the cluster cache. Is that correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes do not break the guarantee that sync will completely update the cluster cache.
sync function populates the cluster cache when it's run.

c.setNode(c.newResource(un))

processEvents goroutine processes the future events that are received in watchEvents goroutine.

go c.watchEvents(ctx, api, resClient, ns, resourceVersion)

watchEvents goroutine watches for the events from k8s resource types. Once an event is received, it's processed.

case event, ok := <-w.ResultChan():

The event is sent to the channel that is read in the processEvents goroutine, where the processing is done in bulk.

c.eventMetaCh <- eventMeta{event, un}

@crenshaw-dev
Copy link
Member

@mpelekh would you be interested in joining a SIG Scalability meeting to talk through the changes?

@crenshaw-dev
Copy link
Member

Could you open an Argo CD PR pointing to this commit so that we can run all Argo's tests?

Signed-off-by: Mykola Pelekh <mpelekh@demonware.net>
@mpelekh mpelekh force-pushed the event-processing-performance-improvement branch from db0f61d to 2f5160b Compare October 10, 2024 14:02
Copy link

sonarcloud bot commented Oct 10, 2024

@mpelekh
Copy link
Author

mpelekh commented Oct 10, 2024

@mpelekh would you be interested in joining a SIG Scalability meeting to talk through the changes?

@crenshaw-dev Yes, I’d be happy to join the SIG Scalability meeting to discuss the changes. Please let me know the time and details or if there’s anything specific I should prepare in advance.

@crenshaw-dev
Copy link
Member

Great! The event is on the Argoproj calendar, and we coordinate in CNCF Slack. The next meeting is two Wednesdays from now at 8am eastern time.

No need to prepare anything really, just be prepared to answer questions about the PR. :-)

@mpelekh
Copy link
Author

mpelekh commented Oct 10, 2024

Could you open an Argo CD PR pointing to this commit so that we can run all Argo's tests?

@crenshaw-dev Sure. Here it is - argoproj/argo-cd#20329.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants