Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for raft repl dev replace member. #546

Merged
merged 1 commit into from
Sep 25, 2024

Conversation

sanebay
Copy link
Contributor

@sanebay sanebay commented Sep 11, 2024

When replacing a member, add the new member, sync raft log for replace and finally remove the old member. Once we add new member, baseline or incremental resync will start. Remove the old member will cause nuraft mesg to exit the group and we periodically gc the destroyed group. Made the repl dev base test common so that both tests files can use. Tests by default create repl group with num_replica's. Dynamic tests create additional spare replica's which can be added to the test dynamically by calling replace member.

Testing

  1. Bring replica-1, replica-2, replica-3. Replica-1 is the leader. Replace the replica-3 with a new spare replica-4.

}

// Step 2. Add the new member.
return m_msg_mgr.add_member(m_group_id, member_in_uuid)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add member, append long entry and remove the old member

RD_LOGI("Raft repl replace_member member_out={} member_in={}", boost::uuids::to_string(member_out),
boost::uuids::to_string(member_in));

m_listener->replace_member(member_out, member_in);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

homeobject or listener can use this to update their pg metablks to remove old and add new member

@@ -134,11 +136,12 @@ class HSReplTestHelper : public HSTestHelper {
HSReplTestHelper(std::string const& name, std::vector< std::string > const& args, char** argv) :
name_{name}, args_{args}, argv_{argv} {}

void setup() {
void setup(uint32_t num_replicas) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup takes how many replica's to start.

@@ -0,0 +1,629 @@
/*********************************************************************************
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved common code from test_raft_repl_dev.cpp to raft_test_base.hpp. No change in code. Just moved the classes.

LOGINFOMOD(replication, "[Replica={}] Read logical snapshot callback obj_id={} term={} idx={} num_items={}",
g_helper->replica_num(), snp_data->offset, s->get_last_log_term(), s->get_last_log_idx(),
kv_snapshot_data.size());
#include "test_common/raft_repl_test_base.hpp"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved these code to raft_repl_test_base.hpp header file.

@@ -0,0 +1,132 @@
/*********************************************************************************
Copy link
Contributor Author

@sanebay sanebay Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test creates repl dev group with 3 replica's by default and then add spare replica's. We couldnt make common code becasue test_raft_repl_dev test cases assume all replica's to be part of the group from the beginning itself.

Copy link
Contributor

@JacksonYao287 JacksonYao287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanebay thanks for the patch!

we need take leader change into account, pls see my comments and correct me if I am wrong


LOGINFO("Replace member added member={} to group_id={}", member_in, group_id_str());

// Step 3. Append log entry to mark the old member is out and new member is added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if leader changes after line140 and before line 147, then who can send a log entry to mark the old member is out and remove member_out of out this group?

should we separate replace_member to two calls: add_member and remove_member?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah valid scenario. In that case log entry append will fail at line 157, CM will retry again with the latest leader. Thats the flow we have decided with CM. It should be safe to send the same message to a different leader.

Copy link
Contributor

@JacksonYao287 JacksonYao287 Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be safe to send the same message to a different leader.

if leader changes after propose_to_raft but before m_msg_mgr.rem_member, then it will return ReplServiceError::RETRY_REQUEST .

now , if cm resends the request to the new leader, I think add_member will succeed in new leader since the member_in is already added in the group by the previous leader. now, next step is propose_to_raft, which will create another HS_CTRL_REPLACE log and commit it.

so HS_CTRL_REPLACE will be committed twice at the other follower, one is from previous leader , one is from current leader.

we should make it safe to commit HS_CTRL_REPLACE twice(or make committing HS_CTRL_REPLACE idempotent) for the same member_out and member_in. so we need to add some checks in m_listener#replace_member(member_out, member_in);

so, I think basically it will be safe, but we should add some code to make committing HS_CTRL_REPLACE idempotent

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more for cm:
if the new leader is the member_in, we havent respond to the call yet. We get the leader (in nuraft_mesg) from raft which is always up_to_date, and respond to CM.

Wondering if CM expect HB from a to-be-added SM , or expect retry aganist a to-be-added SM.

Copy link
Contributor Author

@sanebay sanebay Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idempotency of HS_CTRL_REPLACE is same as create pg or shard. Gateway can send request leader, leader does raft and before returning success to gateway crashed and restarted, gateway will send to same leader and we might create pg. Its a NOP in ho. In HO in add_replace function , in and out members are already applied, then its a nop.

@sanebay
Copy link
Contributor Author

sanebay commented Sep 18, 2024

we may need add_member and remove_member, if we do expansion/shrink. Our use case is only replace and we want to update in the homeobject PG metablk in single write atomically. The member out will have its PG metablk with destroyed flag enabled. GC will use this delete all blobs for that PG.

@JacksonYao287
Copy link
Contributor

The member out will have its PG metablk with destroyed flag enabled. GC will use this delete all blobs for that PG.

do we need gc for the member out? I think if we remove it from the group , we can just re-fromat it with homestore, then it will be clean , no?

src/tests/test_raft_repl_dev_dynamic.cpp Show resolved Hide resolved
.via(&folly::InlineExecutor::instance())
.thenValue([this, member_in_uuid, member_out_uuid](auto&& e) -> AsyncReplResult<> {
// TODO Currently we ignore the cancelled, fix nuraft_mesg to not timeout
// when adding member. Member is added to cluster config until member syncs fully
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change the behavior? This is very odd by mixing member change vs data syncing.

Copy link
Contributor Author

@sanebay sanebay Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nuraft behaviour. I had same doubt. This what Jungsang shared.
"""

  • there are three servers: s1, s2, s3. So quorum size is 2.
  • let's say s3 is lagging behind, but still s1 can make a commit, by forming quorum {s1, s2}.
  • now new member s4 joins the cluster.
  • if s4 joins the cluster immediately without catch-up, it is also lagging behind.
  • since it joins the cluster immediately, now quorum size becomes 3.
  • now s1 cannot make a commit, no matter which member is in the quorum: {s1, s2, s3}, {s1, s2, s4}.
    """

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this scenario makes sense.
another question , do we have a plan to have a separate add_member and rem_member?

for example, if we have a very heavy read workload for one pg, we want to add some new members to take over some read workload. so we only need add_memeber.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JacksonYao287 we dont want that as of now. We dont have the flexibility to control replication size of each PG in control plane. Lets revisit if we have solid use case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanebay that makes sense from the raft point of view, but make it very tricky for cm->sm protocal. Once leader SM reboot, what would be the state of the in-flight add_srv? will it continue till success or being discard? who should take the responsibility to respond to CM?

Is there a possibility that we can add new_member as learner and promoted to participant once it catches up?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for a failed/forgot add_srv, who should take the responsibility to reclaim space? If CM retries with same PG to same SM I believe baseline resync will erase and re-do. If CM attempt with other PG to this SM I am not sure how we handle this --- linking to the fixed pg size design @JacksonYao287

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanebay that makes sense from the raft point of view, but make it very tricky for cm->sm protocal. Once leader SM reboot, what would be the state of the in-flight add_srv? will it continue till success or being discard? who should take the responsibility to respond to CM?

Is there a possibility that we can add new_member as learner and promoted to participant once it catches up?

add_srv internally does cluster config change with log entry. So it needs commit . Its treated as same append entry log. If committed and crashed, CM can send again and we return error that server already exists. If it didnt commit, it will rollback once another leader selects. I havent explored the learner option.

also for a failed/forgot add_srv, who should take the responsibility to reclaim space?
No as mentioned, add_srv should fail with error SERVER_ALREADY_EXISTS . As per design, if new member failed after sometime, manual intervnetion is needed and a new PG move is needed(https://github.com/eBay/NuRaft/blob/master/src/handle_join_leave.cxx#L68)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for a failed/forgot add_srv, who should take the responsibility to reclaim space? If CM retries with same PG to same SM I believe baseline resync will erase and re-do. If CM attempt with other PG to this SM I am not sure how we handle this

yes, a failed add_srv will leave some stale data in the disk of member_in , it is caused by the snapshot before the config change is committed.

I have gone through the raft code. I will add the solution in the doc


LOGINFO("Replace member added member={} to group_id={}", member_in, group_id_str());

// Step 3. Append log entry to mark the old member is out and new member is added.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more for cm:
if the new leader is the member_in, we havent respond to the call yet. We get the leader (in nuraft_mesg) from raft which is always up_to_date, and respond to CM.

Wondering if CM expect HB from a to-be-added SM , or expect retry aganist a to-be-added SM.

@sanebay
Copy link
Contributor Author

sanebay commented Sep 18, 2024

The member out will have its PG metablk with destroyed flag enabled. GC will use this delete all blobs for that PG.

do we need gc for the member out? I think if we remove it from the group , we can just re-fromat it with homestore, then it will be clean , no?

One SM could have multiple PG's on same disk. We could move out some PG for space issues or load balance.

@JacksonYao287
Copy link
Contributor

the code generally looks good to me. pls resolve the conflict

@codecov-commenter
Copy link

codecov-commenter commented Sep 19, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 60.78431% with 20 lines in your changes missing coverage. Please review.

Project coverage is 67.27%. Comparing base (1a0cef8) to head (afb64fc).
Report is 64 commits behind head on master.

Files with missing lines Patch % Lines
src/lib/replication/repl_dev/raft_repl_dev.cpp 56.81% 18 Missing and 1 partial ⚠️
src/lib/replication/repl_dev/common.cpp 0.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##           master     #546       +/-   ##
===========================================
+ Coverage   56.51%   67.27%   +10.76%     
===========================================
  Files         108      109        +1     
  Lines       10300    10448      +148     
  Branches     1402     1408        +6     
===========================================
+ Hits         5821     7029     +1208     
+ Misses       3894     2743     -1151     
- Partials      585      676       +91     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -93,7 +93,10 @@ void RaftReplService::start() {
.with_hb_interval(HS_DYNAMIC_CONFIG(consensus.heartbeat_period_ms))
.with_max_append_size(HS_DYNAMIC_CONFIG(consensus.max_append_batch_size))
.with_log_sync_batch_size(HS_DYNAMIC_CONFIG(consensus.log_sync_batch_size))
// TODO fix the log_gap thresholds when adding new member.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared logs with nuraft team, but we couldnt find the root cause immediately. Will have to debug with nuraft team when they are available. Will create a ticket to keep track of this. #561

// Step 3. Append log entry to mark the old member is out and new member is added.
auto rreq = repl_req_ptr_t(new repl_req_ctx{});
replace_members_ctx members;
std::copy(member_in_uuid.begin(), member_in_uuid.end(), members.in.begin());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can't do member.in = std::move(members_in_uuid), can we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a char array as we are serializing two members. members_in_uuid is a string, so we couldnt use it.


// Check if member_out exists in the group.
std::vector< nuraft::ptr< nuraft::srv_config > > configs_out;
m_msg_mgr.get_srv_config_all(m_group_id, configs_out);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the out is not found, shouldn't we just move ahead to add the in_member. The reason is we have retry logic, and we don't need to worry about which operation actually did moved out the member with mix of crash in between and and if one operation did manage to move the old member, shouldn't we just go ahead ignore the out_member not seen in current group and add in_member?

Also we should check in_member is not already there in the raft group. Since we have the retry logic, and this retry needs to be idomponent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can do that. Instead of doing here, we can depend on raft server itself returns nuraft::cmd_result_code::SERVER_ALREADY_EXISTS for add and nuraft::cmd_result_code::SERVER_NOT_FOUND for remove and ignore those errors.

When replacing a member, add the new member, sync raft log
for replace and finally remove the old member. Once we add
new member, baseline or incremental resync will start.
Remove the old member will cause nuraft mesg to exit
the group and we periodically gc the destroyed group.
Made the repl dev base test common so that both tests files
can use. Tests by default create repl group with num_replica's.
Dynamic tests create additional spare replica's which can be
added to the test dynamically by calling replace member.
@sanebay sanebay merged commit e76afe5 into eBay:master Sep 25, 2024
22 checks passed
@sanebay sanebay deleted the hs_replace_member branch September 25, 2024 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants