Add Endpoints for Metrics and Probes #91

Cardes · 2024-06-14T06:23:29Z

Use Case:

As an Operator i want the orchestrator to handle restarts if the container crashes, therefor i need readiness and liveness checks
Also as an Operator i want to fetch metrics from the proxy so i can visualize the state and trigger alerts on issues

Design Proposal:

add an endpoint that exposes /metrics in prometheus format using the prometheus python client library
add an endpoint that exposes /-/healthy and /-/ready endpoints for Kubernetes and docker compose

Open Points:

Discuss if its feasible to add the endpoints to the existing ports or add a dedicated port for both endpoints

Based on this discussion

Edit: Added Open Points

s-allius · 2024-06-14T18:52:16Z

Great idea. I also find a async version of the prometheus client library.

I think I will start with the health check...

s-allius · 2024-06-15T16:02:49Z

I implemented a first readycheck. This check looks for a proper config file and that both server (Port:5005 and 10000) are started.
If everything is fine, I return on http://172.16.30.7/-/ready a HTTP code 200 and the text Is ready. If there is a problem in the config file, I return a HTTP code 503 and the text Not ready. It is also possible that the HTTP endpoint isn't available on errors or startup.
Does this behaviour fit for k8s?

For the health check, I will evaluate the processing time of the messages. This should actually be quite simple and centralized.

At the moment I use the port 8127 for the http server. Or should we use Port 8080 to make clear that it is a http server?
My idea to use a non standard port is, that normally the user must not map to another port. Port 8080 is surely used by a lot of containers. What do you thing about that?

Cardes · 2024-06-16T17:58:40Z

This looks good, all http codes between 200 and below 400 are considered success, so 503 will get recognized as failed.
For startup purpose its common to define an initialDelaySeconds or a failureThreshold, so no answer / timeout works fine as long as the container is starting up.
Example:

  livenessProbe:
      httpGet:
        path: /-/healthy
        port: 8127
      initialDelaySeconds: 12 #time before the first probe
      periodSeconds: 20 #time between probes
      timoutSeconds: 2 #delay until no answer is considered a timeout (usefull for high load or low compute environments)
      successThreshold: 1 #how many consecutive probes are needed to assume a healthy probe
      failureThreshold: 2 # how many consecutive failed probes are needed to trigger the unhealty state
      terminationGracePeriodSeconds: 10 #how long to wait between shutdown signal and forced stop of the container

The Port could be any number above 1024 that's not used for a common service, so 8127 works fine i think. considering that host network is often used in smaller setups and using an uncommon port helps users to prevent port duplications. In every other environment port mapping/ distinct service ips should prevent any collisions.

Thanks for your effort, let me know if i can help test anything.

Regards,
Sebastian

s-allius added the enhancement New feature or request label Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Endpoints for Metrics and Probes #91

Add Endpoints for Metrics and Probes #91

Cardes commented Jun 14, 2024 •

edited

Loading

s-allius commented Jun 14, 2024

s-allius commented Jun 15, 2024 •

edited

Loading

Cardes commented Jun 16, 2024

Add Endpoints for Metrics and Probes #91

Add Endpoints for Metrics and Probes #91

Comments

Cardes commented Jun 14, 2024 • edited Loading

s-allius commented Jun 14, 2024

s-allius commented Jun 15, 2024 • edited Loading

Cardes commented Jun 16, 2024

Cardes commented Jun 14, 2024 •

edited

Loading

s-allius commented Jun 15, 2024 •

edited

Loading