Docker swarm troubleshooting

1. Service not starting or gets killed often

Run the below command in docker swarm master to find cause of the issue

docker service ps <service-name> --no-trunc

If you don't see any error above, you can check the logs of the service using below command

docker service logs <service-name> --tail 200

Know Issue 1.1: Starting container failed: Address already in use

Docker swarm issue: https://github.com/moby/moby/issues/34163
Remediation:
- SSH into docker swarm agent where this error is reported
- Restart docker service using sudo service docker restart

Know Issue 2.1: Exited (137)

The exit code 137(128+9) is due to KILL signal 9. This happens when the container gets killed by SIGKILL (9). This can happen when container gets killed due to OutOfMemory Error.

Docker swarm issue: https://github.com/moby/moby/issues/21083#issuecomment-239578836
Remediation:
- Confirm if it is due to memory by looking into memory usage metrics for this service container in grafana dashboard
- If it is memory issue, increase the value of reservation_memory and limit_memory for this service and deployment variables
```
resources:
  reservations:
    memory: <new_value_for_reservation_memory>
  limits:
    memory: <new_value_for_limit_memory>
```
- Ensure your application heap size around 1/3rd or 1/2 of the memory reserved. Applications would need memory other heap memory for metadata and other resources

Know Issue 2.2: `task: non-zero exit (137): dockerexec: unhealthy container`

This occurs if the container health check fails. Docker swarm will stop the unhealthy container and launches a new container till it is healthy. Please check the logs for this service in kibana and understand the root cause and fix the health check endpoint in service

Remediation:
- If the health check timeout is small increase the timeout accordingly in deployment scripts
- If the health check is failing due to failing upstream without which this service can't work, fix the upstream service issue
- If the health check is failing due to failing upstream without which this service can work, ensure health check endpoint doesn't fail due to failing upstream. Have a timeout for external service calls to ensure it doesn't block indefinitely

2. Docker swarm worker node is down

Docker worker node shows as down on executing docker node ls.

Remediation: SSH into the agent node shown as down and run sudo service docker restart

If you are unable to SSH into server, you would have to restart the server azure portal

If service restart doesn't resolve the issue, you need to execute commands to make this docker agent join as worker node. Follow the below steps

SSH into docker swarm master and run docker swarm join-token worker and copy the output which looks like docker swarm join --token <token> <master_address>
SSH into docker swarm worker node and run the copied command above docker swarm join --token <token> <master_address>
If you get an error as shown below

ops@swarmm-agentpublic-18950373000009:~$ docker swarm join --token SWMTKN-1-41nve2rhdm8rpa3dp93567zkaas0h94y807e6j7n8d8utzu35s-4385lo005rdp3qs395hc4ul3m 172.16.0.5:2377
Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
ops@swarmm-agentpublic-18950373000009:~$ docker swarm leave
Error response from daemon: context deadline exceeded

It could be due to below issue

Know Issue 2.1: Unable to join swarm

Docker swarm issue: https://github.com/moby/moby/issues/25432#issuecomment-303414091
Remediation:
- SSH into worker node and run
```
sudo cp -r /var/lib/docker/swarm /tmp/swarm-backup
sudo service docker stop
sudo rm -rf /var/lib/docker/swarm
sudo service docker start
```
- SSH into swarm master node and execute docker node ls. You would see two nodes listed for the worker node where one node shows as down. Copy the ID of node shown as down and run docker node rm <ID>

3. Jenkins job fails while connecting to jenkins slave

For Error

Cannot contact <slave-name>: 
hudson.remoting.RequestAbortedException:
java.nio.channels.ClosedChannelException

Remediation: Re trigger the job

4. Containers within same network are unable to communicate

Eliminate the possibility of configuration issue before trying to remediate using steps below

Docker Swarm Issue: https://github.com/docker/swarm/issues/2161

Remediation:

Restart the container of the service by executing

 docker service scale <service-name>=0
 docker service scale <service-name>=<expected-replication>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker swarm troubleshooting

1. Service not starting or gets killed often

Know Issue 1.1: Starting container failed: Address already in use

Know Issue 2.1: Exited (137)

Know Issue 2.2: `task: non-zero exit (137): dockerexec: unhealthy container`

2. Docker swarm worker node is down

Know Issue 2.1: Unable to join swarm

3. Jenkins job fails while connecting to jenkins slave

4. Containers within same network are unable to communicate

Clone this wiki locally

Docker swarm troubleshooting

1. Service not starting or gets killed often

Know Issue 1.1: Starting container failed: Address already in use

Know Issue 2.1: Exited (137)

Know Issue 2.2: task: non-zero exit (137): dockerexec: unhealthy container

2. Docker swarm worker node is down

Know Issue 2.1: Unable to join swarm

3. Jenkins job fails while connecting to jenkins slave

4. Containers within same network are unable to communicate

Clone this wiki locally

Know Issue 2.2: `task: non-zero exit (137): dockerexec: unhealthy container`