-
Notifications
You must be signed in to change notification settings - Fork 367
Docker swarm troubleshooting
Run the below command in docker swarm master to find cause of the issue
docker service ps <service-name> --no-trunc
If you don't see any error above, you can check the logs of the service using below command
docker service logs <service-name> --tail 200
- Docker swarm issue: https://github.com/moby/moby/issues/34163
-
Remediation:
- SSH into docker swarm agent where this error is reported
- Restart docker service using
sudo service docker restart
The exit code 137(128+9) is due to KILL signal 9. This happens when the container gets killed by SIGKILL (9)
. This can happen when container gets killed due to OutOfMemory Error.
- Docker swarm issue: https://github.com/moby/moby/issues/21083#issuecomment-239578836
-
Remediation:
- Confirm if it is due to memory by looking into memory usage metrics for this service container in grafana dashboard
- If it is memory issue, increase the value of
reservation_memory
andlimit_memory
for this service and deployment variables
resources: reservations: memory: <new_value_for_reservation_memory> limits: memory: <new_value_for_limit_memory>
- Ensure your application heap size around 1/3rd or 1/2 of the memory reserved. Applications would need memory other heap memory for metadata and other resources
This occurs if the container health check fails. Docker swarm will stop the unhealthy container and launches a new container till it is healthy. Please check the logs for this service in kibana and understand the root cause and fix the health check endpoint in service
-
Remediation:
- If the health check timeout is small increase the timeout accordingly in deployment scripts
- If the health check is failing due to failing upstream without which this service can't work, fix the upstream service issue
- If the health check is failing due to failing upstream without which this service can work, ensure health check endpoint doesn't fail due to failing upstream. Have a timeout for external service calls to ensure it doesn't block indefinitely
Docker worker node shows as down on executing docker node ls
.
Remediation: SSH into the agent node shown as down and run sudo service docker restart
If you are unable to SSH into server, you would have to restart the server azure portal
If service restart doesn't resolve the issue, you need to execute commands to make this docker agent join as worker node. Follow the below steps
- SSH into docker swarm master and run
docker swarm join-token worker
and copy the output which looks likedocker swarm join --token <token> <master_address>
- SSH into docker swarm worker node and run the copied command above
docker swarm join --token <token> <master_address>
- If you get an error as shown below
ops@swarmm-agentpublic-18950373000009:~$ docker swarm join --token SWMTKN-1-41nve2rhdm8rpa3dp93567zkaas0h94y807e6j7n8d8utzu35s-4385lo005rdp3qs395hc4ul3m 172.16.0.5:2377
Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
ops@swarmm-agentpublic-18950373000009:~$ docker swarm leave
Error response from daemon: context deadline exceeded
It could be due to below issue
- Docker swarm issue: https://github.com/moby/moby/issues/25432#issuecomment-303414091
- Remediation:
- SSH into worker node and run
sudo cp -r /var/lib/docker/swarm /tmp/swarm-backup sudo service docker stop sudo rm -rf /var/lib/docker/swarm sudo service docker start
- SSH into swarm master node and execute
docker node ls
. You would see two nodes listed for the worker node where one node shows as down. Copy theID
of node shown as down and rundocker node rm <ID>
For Error
Cannot contact <slave-name>:
hudson.remoting.RequestAbortedException:
java.nio.channels.ClosedChannelException
Remediation: Re trigger the job
Eliminate the possibility of configuration issue before trying to remediate using steps below
- Docker Swarm Issue: https://github.com/docker/swarm/issues/2161
- Remediation:
- Restart the container of the service by executing
docker service scale <service-name>=0 docker service scale <service-name>=<expected-replication>