Server trouble‐shooting

Log files

Each server component (scheduler, feeder, transitioner, etc.) has its own log file. These files are in the log_HOSTNAME subdirectory of the project directory. The logs have entries for errors, indicated by CRITICAL; e.g.

make_work_mt.out:1601921:2022-08-16 20:36:53.1759 [CRITICAL] can't find wu wu_multi_thread_nodelete

This generally means that something that needs to happen isn't happening; you need to figure it out and fix it.

In addition, log entries can describe normal events. To control the verbosity of the log files:

Scheduler: set the desired logging options
File upload handler: set fuh_debug_level.
daemons: pass the cmdline arg "-d N" (1=least verbose, 4=most verbose) If you run server components with -d 4, their database queries will be logged. This is useful for tracking down database-level problems.

If you're interested in the history of a particular job, grep for WU#12345 or RESULT#12345 (where 12345 represents the ID) in the log files. The html/ops pages also provide an interface for this.

Examining the database

The admin web interface provides a web-based interface for browsing your project's database.

You can also use MySQL tools such as

The mysql interpreter. The 'show processlist;' query is useful for diagnosing DB performance problems.
mytop: like 'top' for MySQL: shows running queries.
phpMyAdmin: general-purpose web interface to MySQL

Examining shared memory

The command

bin/show_shmem

will print a textual summary of the contents of the shared-memory structure that caches jobs and information about applications.

Trouble-shooting the job pipeline

Are workunits (jobs) getting created correctly? Examine the database to see. If you're using a work generator, check its log file.
Are results (job instances) getting created? Examine the database to see. If you don't see results, check the transitioner log file.
Are jobs getting into shared memory? Use show_shmem (see above). You should see jobs. If not, check the feeder log file.
Is the scheduler sending jobs? If not, check its log file, preferably with the following log flags:
- <debug_version/>: show details of app version selection
- <debug_send/>: show details of job assignment
- <debug_quota/>: show details of quota enforcement
Are clients processing jobs correctly? Check the status and stderr output of completed jobs.
Are output files getting uploaded? Check the file upload handler log file.
Are jobs getting validated? Check the validator log file.
Are jobs getting assimilated? Check the assimilator log file.

Debugging the scheduler

If the scheduler is acting incorrectly or crashing, and you like mucking around in C++ source code, you can run it under a debugger like gdb. The scheduler is a CGI program; it reads a request from stdin and writes a reply to stdout. So you can debug it as follows:

Copy the "scheduler_request_X.xml" file from a client to the machine running the scheduler. (X = your project URL)
Run the scheduler under the debugger, giving it this file as stdin, i.e.:

gdb cgi
(set a breakpoint if desired)
r < scheduler_request_X.xml

You may have to doctor the database as follows to keep the scheduler from rejecting the request:

update host set rpc_seqno=0, rpc_time=0 where hostid=N

As an alternative to this, edit sched/handle_request.cpp, and put a call to debug_sched("debug_sched"); just before sreply.write(fout, sreq);. Then, after recompiling, touch a file called 'debug_sched' in the project root directory. This will cause transcripts of all subsequent scheduler requests and replies to be written to the cgi-bin/ directory with separate small files for each request. The file names are sched_request_H_R and sched_reply_H_R where H=hostid and R=rpc sequence number. This can be turned off by deleting the 'debug_sched' file.

To get core files for scheduler crashes, uncomment the following line in sched/sched_main.cpp, and recompile:

#define DUMP_CORE_ON_SEGV 1

Home

Provide feedback

Saved searches