single instance in cluster left standing #26

mreiferson · 2012-02-28T04:27:42Z

it seems that if a cluster of N nodes is stood up and all but 1 node is taken down, that single node is no longer responsive as demonstrated by the log output below (note: it contains additional debugging statements added as I've been poking around):

reproducing this case should be relatively simple:

$ doozerd &
$ doozerd -l 127.0.0.1:8047 -w false -a 127.0.0.1:8046 &
$ export SECONDPID=$!
$ echo -n | doozer add /ctl/cal/1
$ kill -9 $SECONDPID

now the remaining instance is in a loop where it is unable to meet the quorom and therefore unable to continue (nor will it respond to any state mutations).

i'm still thinking through what I would expect the daemon to do in this case, at the very least I would expect it to respond (even if those responses were errors for most operations... allowing one to continue to administrate the cluster and ideally bring it back operationally)

DOOZER 2012/02/27 23:14:48.799495 applying 443 TICK
DOOZER 2012/02/27 23:14:48.799504 applied m.tick 1
DOOZER 2012/02/27 23:14:48.799509 p.seqn=443 m.next=494
DOOZER 2012/02/27 23:14:49.278193 p.seqn=444 m.next=494
DOOZER 2012/02/27 23:14:49.278264 __DEBUG: acceptor.update cmd:INVITE seqn:444 crnd:2 
DOOZER 2012/02/27 23:14:49.278339 __DEBUG: learner.update &{1 2 2 map[] [false false]  false} cmd:INVITE seqn:444 crnd:2  0
DOOZER 2012/02/27 23:14:49.278485 p.seqn=444 m.next=494
DOOZER 2012/02/27 23:14:49.278558 __DEBUG: acceptor.update cmd:RSVP seqn:444 crnd:2 vrnd:0 value:"" 
DOOZER 2012/02/27 23:14:49.278642 __DEBUG: learner.update &{1 2 2 map[] [false false]  false} cmd:RSVP seqn:444 crnd:2 vrnd:0 value:""  1
DOOZER 2012/02/27 23:14:49.278669 p.seqn=444 m.next=494
DOOZER 2012/02/27 23:14:49.278715 __DEBUG: acceptor.update cmd:RSVP seqn:444 crnd:2 vrnd:0 value:"" 
DOOZER 2012/02/27 23:14:49.278778 __DEBUG: learner.update &{1 2 2 map[] [false false]  false} cmd:RSVP seqn:444 crnd:2 vrnd:0 value:""  0
DOOZER 2012/02/27 23:14:49.278897 p.seqn=444 m.next=494
DOOZER 2012/02/27 23:14:49.278986 __DEBUG: acceptor.update cmd:NOMINATE seqn:444 crnd:2 value:"-1:/ctl/node/T6JCULE4RZN2CE6L/applied=443" 
DOOZER 2012/02/27 23:14:49.279087 __DEBUG: learner.update &{1 2 2 map[] [false false]  false} cmd:NOMINATE seqn:444 crnd:2 value:"-1:/ctl/node/T6JCULE4RZN2CE6L/applied=443"  0
DOOZER 2012/02/27 23:14:49.279221 p.seqn=444 m.next=494
DOOZER 2012/02/27 23:14:49.279341 __DEBUG: acceptor.update cmd:VOTE seqn:444 vrnd:2 value:"-1:/ctl/node/T6JCULE4RZN2CE6L/applied=443" 
DOOZER 2012/02/27 23:14:49.279448 __DEBUG: learner.update &{1 2 2 map[] [false false]  false} cmd:VOTE seqn:444 vrnd:2 value:"-1:/ctl/node/T6JCULE4RZN2CE6L/applied=443"  0
DOOZER 2012/02/27 23:14:49.279480 p.seqn=444 m.next=494
DOOZER 2012/02/27 23:14:49.279534 __DEBUG: acceptor.update cmd:VOTE seqn:444 vrnd:2 value:"-1:/ctl/node/T6JCULE4RZN2CE6L/applied=443" 
DOOZER 2012/02/27 23:14:49.279647 __DEBUG: learner.update &{2 2 2 map[-1:/ctl/node/T6JCULE4RZN2CE6L/applied=443:1] [true false]  false} cmd:VOTE seqn:444 vrnd:2 value:"-1:/ctl/node/T6JCULE4RZN2CE6L/applied=443"  1
DOOZER 2012/02/27 23:14:49.279667 learn seqn=444
DOOZER 2012/02/27 23:14:49.279873 event {444 /ctl/node/T6JCULE4RZN2CE6L/applied 443 444 -1:/ctl/node/T6JCULE4RZN2CE6L/applied=443 <nil> <node>}
DOOZER 2012/02/27 23:14:49.279956 del run 444
DOOZER 2012/02/27 23:14:49.280175 __DEBUG: isLeader T6JCULE4RZN2CE6L VBY7NDUVYVUBIJNF 0 0
DOOZER 2012/02/27 23:14:49.280219 __DEBUG: isLeader VBY7NDUVYVUBIJNF VBY7NDUVYVUBIJNF 0 1
DOOZER 2012/02/27 23:14:49.280236 add run 494
DOOZER 2012/02/27 23:14:49.280287 runs: ..................................................
DOOZER 2012/02/27 23:14:49.280296 avg tick delay: -1
DOOZER 2012/02/27 23:14:49.280304 avg fill delay: -1
DOOZER 2012/02/27 23:14:49.280320 p.seqn=444 m.next=495
DOOZER 2012/02/27 23:14:49.280335 p.seqn=444 m.next=495
DOOZER 2012/02/27 23:14:49.800186 prop &{445 [45 49 58 47 99 116 108 47 110 111 100 101 47 86 66 89 55 78 68 85 86 89 86 85 66 73 74 78 70 47 97 112 112 108 105 101 100 61 52 52 52]}
DOOZER 2012/02/27 23:14:49.800214 p.seqn=445 m.next=495
DOOZER 2012/02/27 23:14:49.800234 sched tick=1 seqn=445 t=197605
DOOZER 2012/02/27 23:14:49.800333 __DEBUG: acceptor.update cmd:PROPOSE seqn:445 value:"-1:/ctl/node/VBY7NDUVYVUBIJNF/applied=444" 
DOOZER 2012/02/27 23:14:49.800464 __DEBUG: learner.update &{1 2 2 map[] [false false]  false} cmd:PROPOSE seqn:445 value:"-1:/ctl/node/VBY7NDUVYVUBIJNF/applied=444"  -1
DOOZER 2012/02/27 23:14:49.800548 p.seqn=445 m.next=495
DOOZER 2012/02/27 23:14:49.800643 __DEBUG: acceptor.update cmd:INVITE seqn:445 crnd:3 
DOOZER 2012/02/27 23:14:49.800715 __DEBUG: learner.update &{1 2 2 map[] [false false]  false} cmd:INVITE seqn:445 crnd:3  1
DOOZER 2012/02/27 23:14:49.800845 p.seqn=445 m.next=495
DOOZER 2012/02/27 23:14:49.800905 __DEBUG: acceptor.update cmd:RSVP seqn:445 crnd:3 vrnd:0 value:"" 
DOOZER 2012/02/27 23:14:49.800981 __DEBUG: learner.update &{1 2 2 map[] [false false]  false} cmd:RSVP seqn:445 crnd:3 vrnd:0 value:""  1
DOOZER 2012/02/27 23:14:49.810202 applying 445 TICK
DOOZER 2012/02/27 23:14:49.810221 applied m.tick 1
DOOZER 2012/02/27 23:14:49.810231 p.seqn=445 m.next=495
DOOZER 2012/02/27 23:14:49.810242 tick wasteful=false
DOOZER 2012/02/27 23:14:49.810261 sched tick=2 seqn=445 t=3929262
DOOZER 2012/02/27 23:14:49.810332 __DEBUG: acceptor.update cmd:TICK seqn:445 
DOOZER 2012/02/27 23:14:49.810443 __DEBUG: learner.update &{1 2 2 map[] [false false]  false} cmd:TICK seqn:445  -1
DOOZER 2012/02/27 23:14:49.810513 p.seqn=445 m.next=495
DOOZER 2012/02/27 23:14:49.810588 __DEBUG: acceptor.update cmd:INVITE seqn:445 crnd:5 
DOOZER 2012/02/27 23:14:49.810637 __DEBUG: learner.update &{1 2 2 map[] [false false]  false} cmd:INVITE seqn:445 crnd:5  1
DOOZER 2012/02/27 23:14:49.810749 p.seqn=445 m.next=495
DOOZER 2012/02/27 23:14:49.810814 __DEBUG: acceptor.update cmd:RSVP seqn:445 crnd:5 vrnd:0 value:"" 
DOOZER 2012/02/27 23:14:49.810865 __DEBUG: learner.update &{1 2 2 map[] [false false]  false} cmd:RSVP seqn:445 crnd:5 vrnd:0 value:""  1
DOOZER 2012/02/27 23:14:49.820312 applying 445 TICK

The text was updated successfully, but these errors were encountered:

mreiferson · 2012-02-28T16:57:46Z

i think this succinctly describes what the expectation in behavior should be :)

https://cwiki.apache.org/confluence/display/ZOOKEEPER/FailureScenarios

mreiferson · 2012-02-28T18:12:46Z

More notes... it seems like there are two cases:

The case where there are actual failures and those nodes are never coming back. I think that in this case one should operationally be able to issue some sort of "god" command that could mutate state without quorum. This would resolve the chicken and egg problem of wanting to add nodes to meet quorum but being unable to mutate state due to lack of quorum.
The case where there was a network partition - no changes should need to be made for this as the once the network issue is resolved the nodes should sync up with identical ids.

bketelsen · 2013-03-01T15:27:19Z

Is this still a valid issue?

mreiferson · 2013-03-01T15:34:57Z

yep, but one possible resolution is that this is expected behavior (to fall into read only mode) and the "correct" way to resolve it is by re-bootstrapping the cluster.

kr · 2013-03-24T09:08:40Z

Seems like there's two parts to this: making sure that any running
doozerd process continues to be responsive regardless of what
the other nodes do. Whether or not it can get consensus, it should
at least allow read operations and other things that don't need to
go through paxos. If that ever fails it's simply a bug.

Then there's the question of what to do when a majority of nodes
fail before new nodes can be added to replace them, causing the
cluster to lose quorum. This isn't just when only 1 node is left, but
any time there's no quorum. In other words, if you start with say 7
nodes, I'd expect to see this if you killed 4 of them and left 3 running.
FWIW I agree with the idea of letting an operator bypass paxos to
recover from this sort of situation. It was always on our mental todo
list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

single instance in cluster left standing #26

single instance in cluster left standing #26

mreiferson commented Feb 28, 2012

mreiferson commented Feb 28, 2012

mreiferson commented Feb 28, 2012

bketelsen commented Mar 1, 2013

mreiferson commented Mar 1, 2013

kr commented Mar 24, 2013

single instance in cluster left standing #26

single instance in cluster left standing #26

Comments

mreiferson commented Feb 28, 2012

mreiferson commented Feb 28, 2012

mreiferson commented Feb 28, 2012

bketelsen commented Mar 1, 2013

mreiferson commented Mar 1, 2013

kr commented Mar 24, 2013