Skip to content

custody probe

Jeffrey Wear edited this page May 18, 2018 · 3 revisions

This document outlines potential architectures to be used to implement a fix for #2, that is, report the status of child (webserver) processes run under Supervisor in addition to the status of the parent (build) processes directly monitored by Supervisor.

v1 Goals

  1. custody should report the same process state and description as Supervisor in the following states:
    • the build process is stopped
    • the build process has cleanly exited
    • the build process has crashed
    • the build process is running and the child process is also running
  2. When the child process has crashed, custody should report the child process' state ("FATAL") and description (an error message).
  3. custody should immediately report the appropriate state on launch, even if the child process crashed prior to custody's launch.
  4. custody should live-update the child process' state just like it live-updates the build process' state.

Implementation ideas

Analyze process logs

A simple way to make the statuses more accurate might be for custody to tail process logs. If a message like "process crashed", "error", etc appeared as the final message in the log, custody could consider the process to be in error. Then if a message appeared after that, custody could consider the process to be running again.

Pros

We wouldn't have to make any changes to our services.

Cons

This could involve adding a lot of service-specific knowledge to custody and I'm not sure how cleanly we could derive errors from the logs. This also presumes that the service logs a message when it crashes, and logs another message when it revives. This assumption probably holds but custody can't guarantee it. In particular, I'm not sure that custody could distinguish between a post-error message signifying that the service had come back up, and some other part of the build process running.

Use IPC between the supervised process(es) and custody

nodemon knows when the webserver has crashed; it (and gulp-nodemon) emit events when the process crashes. So if we had a way for the gulp processes to communicate with the custody process, the former could just tell custody when they were / weren't healthy. In fact we could use this interprocess mechanism to communicate all sorts of status updates, like "BUILDING" or "RESTARTED".

And/or, we could have the actual webserver processes message custody. https://github.com/mixmaxhq/custody/tree/enhanced_statuses and https://github.com/mixmaxhq/contacts/tree/enhanced_process_status (Mixmax-internal link) prototype this.

Pros

This approach would be very clean and generic. If we instrumented the webserver processes themselves, we could even let custody know the precise exception with which the process crashed.

Cons

Have to instrument the process(es) rather than just making changes to custody.

If we just instrumented the build process, it wouldn't know exactly why the child process crashed unless it (or custody) engaged in log parsing.

If we just instrumented the webserver process, it wouldn't be able to report on other parts of the build process like to say "building" before the webserver started up.

Also if we instrumented the webserver process, I'm not sure we could synchronously report the error before letting the process crash although the prototype (Mixmax-internal link) appears to work.

Since IPC messages are ephemeral, custody would not immediately report the appropriate state on launch if the child process had crashed prior to custody's launch.

Have the supervised process(es) write to custody-specific logfiles

This approach involves the build process and/or webserver process writing status messages to a custody-specific per-service logfile (as opposed to the default console logfile). This would make it much easier for custody to analyze these logs, since every update to the logfile would be a meaningful message, and the file would not have to be human-readable / the messages could be in a format that was easy for custody to parse.

It might be ok for the supervised processes to overwrite the contents of this file with every message rather than append to it since custody's only interested in the most recent state.

Update: this was implemented as #25 / custody-probe. The logfiles are indeed overwritten with each message. The logfiles are named as NAME_OF_PROGRAM.statefile, where NAME_OF_PROGRAM is the name of the Supervisor program to which the Node process belongs.

This does require the user to hardcode the program name into the server processes, which is slightly unfortunate. We explored a model where the statefiles were named PID.statefile, and the custody would look up the parent process ID to find the program to which to attribute the state, but this broke down when the server crashed (and the process died), since then custody could no longer, asynchronously, determine the parent process. Another downside of having a statefile per-process rather than per-program is that then the statefile folder would have to be "garbage-collected" periodically to cull files corresponding to dead processes.

Pros

This approach improves on the IPC model since:

  • the webserver process could synchronously write to this logfile before crashing
  • since the logfile would be persistent, custody could report the appropriate state on launch even if the child process had crashed prior to custody's launch

Cons

Like with the IPC model, we'd have to decide which process(es) were going to report the status (build and/or webserver).

custody would have to monitor for the appearance of these logfiles, then tail them.