Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set worker_processes to a static number #140

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dkeightley
Copy link

On clusters with large nodes the auto value for worker_processes can scale too high, with the number of threads (32 per worker) this can consume many file handles.

Determined 4 as a reasonable default for the use case.

Related: rancher/rancher#27693

@cbron cbron requested review from superseb and kinarashah and removed request for cbron November 29, 2021 15:04
Copy link
Contributor

@superseb superseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked issue describes: "allow user to configure", and the PR is setting a default for everyone. Why was chosen to change the default and not make it configurable for the exception (which apparently is a node with a lot of (v)CPUs)? Why was 4 determined as the reasonable default? Why won't this affect upgraded clusters that move from auto to 4?

@dkeightley
Copy link
Author

Hi @superseb, the issue is related, it's not a direct fix, however the reason I could see to set a default:

  • It's a proxy deployed by RKE that users are largely unaware of
  • It shouldn't need to be tuned like ingress-nginx for example
  • It has a relatively consistent workload profile
  • It is a starting point, can be configurable in future

nginx is capable of many requests with a small number of workers when acting as a reverse proxy, for the purpose of kubelet -> kube-apiserver of a single node 4 was determined as adequate. Totally open to input here, the intention is not to choose a particular value but to avoid nginx-proxy inadvertently consuming large amounts of PIDs, file handles etc. without a way to avoid it.

@dkeightley dkeightley requested a review from superseb July 19, 2022 00:21
@superseb
Copy link
Contributor

@dkeightley Given the issue (which has not seen any activity so far), it is about limiting nginx-proxy worker_processes so it doesn't configure 100 processes when 100 (v)CPUs/cores are found. If we set it to 4, what are the performance implications on 2 and 4 core machines?

Copy link
Contributor

@superseb superseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See ^^. And rebase and push so build can complete.

@dkeightley
Copy link
Author

Thanks @superseb, performance changes should be minimal with a small number of process increase, compared to nodes with high core counts.

If you think it's a better fit for nginx-proxy, worker_processes could be set lower (1-2) as well, control plane nodes are typically 1-3 nodes, connectivity is used primarily by the kubelet, so use case is relatively static and could be handled by a single worker.

@superseb
Copy link
Contributor

I guess you are right but its very simple to test. Is this change tested or is this based on assumption/guessing?

The only downside to setting it statically is that we are changing it for all installs at once with no way to change it back to the old behavior.

Please rebase instead of adding a merge commit. (or as it is now, squash commits)

@dkeightley
Copy link
Author

Testing performed using derekdemo/rke-tools:worker_processes, using image from commit 8816bdd.

...
Successfully tagged rancher/rke-tools:8816bdd

Tested on a node with 2 CPUs:

# cat /proc/cpuinfo | grep processor | wc -l
2

Worker node with default rke-tools image:

# docker ps | grep nginx-proxy
ca6a69f06d84   rancher/rke-tools:v0.1.80             "nginx-proxy CP_HOST…"   30 minutes ago   Up 30 minutes             nginx-proxy
# docker exec nginx-proxy ps aux | grep worker
   12 nginx     0:00 nginx: worker process
   13 nginx     0:00 nginx: worker process
# docker exec nginx-proxy grep worker_processes /etc/nginx/nginx.conf
worker_processes auto;

Worker node with updated image:

# docker ps | grep nginx-proxy
9d8ebc974a85   derekdemo/rke-tools:worker_processes   "nginx-proxy CP_HOST…"   20 seconds ago   Up 18 seconds             nginx-proxy
# docker exec nginx-proxy ps aux | grep worker
   13 nginx     0:00 nginx: worker process
   14 nginx     0:00 nginx: worker process
   15 nginx     0:00 nginx: worker process
   16 nginx     0:00 nginx: worker process
# docker exec nginx-proxy grep worker_processes /etc/nginx/nginx.conf
worker_processes 4;

No issues observed from the kubelet logs with the new container image, node was active/Ready from the Kubernetes perspective.

Copy link
Contributor

@superseb superseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will at least need some basic load testing, but I'm going to assume that the downside (having too many workers) are worse than the upside.

@superseb superseb requested review from a team and removed request for kinarashah August 30, 2022 09:59
Copy link
Contributor

@superseb superseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the change as it seems to resolve an issue for support, I don't think the testing was sufficient but if QA is going to test it in bigger clusters/with more load, it should be okay.

@a-blender
Copy link

Good to merge with

  • Test template added to the issue with what's already been tested
  • @sowmyav27 QA should do basic load balancer testing and tests with a bigger cluster/larger workload to see the performance implications of worker_processes set to 4.

@dkeightley
Copy link
Author

Thanks @annablender, the testing so far has been using a drop in replacement of the proposed rke-tools container to confirm:

A load test would be worthwhile, however it's expected that the value for worker_processes would never have been a performance bottleneck given the small use case and node sizes often creating around 4 workers (auto detects and sets 1 worker per CPU).

Just for clarity, rke-tools used in this scenario is used only to run the nginx-proxy container, this provides a reverse proxy (nginx) on each worker node [1] listening on 127.0.0.1:6443 to load balance requests back to the control plane nodes (proxy_pass).

              worker node              ||       control plane node
kubelet -> nginx-proxy (proxy_pass)   ----->    kube-apiserver

The change only effects the kubelet connectivity to the kube-apiserver.

Setting worker_processes to 4 would exceed the typical number of control plane nodes (2-3). Even with a spare worker, each worker can process many requests, in this case 1024 simultaneous connections per worker:

# docker exec -it nginx-proxy grep worker /etc/nginx/nginx.conf
worker_processes 4;
  worker_connections 1024;

[1] Nodes that have controlplane and worker roles, or all roles, don't run an nginx-proxy container, therefore the kubelet connects on 127.0.0.1:6443 directly to the kube-apiserver container (binds to 6443/TCP) of the node it resides.

@dkeightley
Copy link
Author

Anything else we need to do to move forward with QA?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants