Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporary packet loss causes permanent node hang #59

Open
tv42 opened this issue Apr 4, 2013 · 1 comment
Open

Temporary packet loss causes permanent node hang #59

tv42 opened this issue Apr 4, 2013 · 1 comment

Comments

@tv42
Copy link

tv42 commented Apr 4, 2013

With the firedrill 3-node setup, dropping packets for >5 seconds:

sudo iptables -I INPUT --proto udp --dport 8047 -j DROP; sleep 7; sudo iptables -D INPUT --proto udp --dport 8047 -j DROP

(whether I block packets in one direction or both doesn't seem to affect behavior)

causes one or more of the nodes to get kicked out of the cluster, but the victim doesn't realize this happened and just hangs. This is true even after network connectivity is restored.

Interestingly, temporarily blocking the node on port 8047 often causes a different node get kicked. My latest run actually kicked the nodes on port 8046 and 8048, thus translating a single-node temporary outage into a cluster failure (as mailing list has told me, doozer doesn't recover from loss of quorum).

The kicked node never recovers, unless the process is restarted as a whole, but this might be related to #44.

@bernerdschaefer
Copy link
Contributor

I can consistently reproduce this with the following script: https://gist.github.com/bernerdschaefer/5714719

After the script is finished, if you check out web UI you'll see that the node on 8048 never advances and is eventually kicked out of the cluster. Checking the logs, the node indefinitely logs lines like:

DOOZER 2013/06/05 15:29:50.607045 p.seqn=473 m.next=181
DOOZER 2013/06/05 15:29:50.617452 p.seqn=473 m.next=181
DOOZER 2013/06/05 15:29:50.627544 p.seqn=473 m.next=181
DOOZER 2013/06/05 15:29:50.637069 p.seqn=473 m.next=181
DOOZER 2013/06/05 15:29:50.647252 p.seqn=473 m.next=181

The problem is that during a partition, a node call fall far enough behind that it cannot be caught up from the history, but it does not attempt to recover the state (as it does at startup).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants