-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable home raft log store UT #515
enable home raft log store UT #515
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #515 +/- ##
===========================================
+ Coverage 56.51% 68.21% +11.70%
===========================================
Files 108 109 +1
Lines 10300 10433 +133
Branches 1402 1400 -2
===========================================
+ Hits 5821 7117 +1296
+ Misses 3894 2638 -1256
- Partials 585 678 +93 ☔ View full report in Codecov by Sentry. |
} | ||
#endif | ||
|
||
m_log_store->truncate(to_store_lsn(compact_lsn)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in compact here , we need to truncate , which will update the start_index. flush will not update start_index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done purposefully. We let resource manager do truncation @yamingk can confirm this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then , then start_index will not be updated on time, which will probably be wrong to upper layer.
for the view of upper layer , if it compact to lsn n , then the start_index it can see after campact should be lsn + 1.
so , I believe we should do truncate here, and not depend on resource manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be discussed with @yamingk
9901d01
to
b39ffdb
Compare
@@ -165,6 +165,7 @@ jobs: | |||
- name: Build Cache | |||
run: | | |||
pre=$([[ "${{ inputs.build-type }}" != "Debug" ]] && echo "-o sisl:prerelease=${{ inputs.prerelease }}" || echo "") | |||
sudo rm -rf $ANDROID_HOME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the failure CI , Create and Test Package is not scheduled ,so sudo rm -rf /usr/local/lib/android/
will not be executed in that CI, which might cause a no-space left issue.
here , I add this to Build Cache
phrase, which will be definitely scheduled in every CI pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
3 issues are solved:
- clean up the android home so we can enlarge disk size to 2GB
- add
for loop
for flush, this make sense based on @JacksonYao287 's explanation and it is perfect we catch this in the re-enabled ut, - for compact() call we do compact and move the flush into compact to avoid flushing logs before the start_index
baad61f
to
9cac8e5
Compare
@@ -264,8 +264,8 @@ raft_buf_ptr_t HomeRaftLogStore::pack(ulong index, int32_t cnt) { | |||
[this, &out_buf, &remain_cnt]([[maybe_unused]] store_lsn_t cur, const log_buffer& entry) mutable -> bool { | |||
if (remain_cnt-- > 0) { | |||
size_t avail_size = out_buf->size() - out_buf->pos(); | |||
if (avail_size < entry.size()) { | |||
avail_size += std::max(out_buf->size() * 2, (size_t)entry.size()); | |||
if (avail_size < entry.size() + sizeof(uint32_t)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avail_size should be able to hold entry.size() and the length of this entry, which is a uint32
9f02276
to
c3e0ab3
Compare
@@ -177,6 +177,7 @@ void HomeLogStore::on_log_found(logstore_seq_num_t seq_num, const logdev_key& ld | |||
|
|||
void HomeLogStore::truncate(logstore_seq_num_t upto_lsn, bool in_memory_truncate_only) { | |||
if (upto_lsn < m_start_lsn) { return; } | |||
flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a flush here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the caller does not do a explicitly flush before truncating, then m_tail_lsn
will not be updated and then truncating might not be able to truncate to the expected one.
add a flush
here to make sure we do a flush before truncating. if flush is already scheduled before truncating, then the flush here will do nothing and just return.
} | ||
#endif | ||
|
||
m_log_store->truncate(to_store_lsn(compact_lsn)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done purposefully. We let resource manager do truncation @yamingk can confirm this.
@@ -878,14 +878,6 @@ TEST_F(RaftReplDevTest, BaselineTest) { | |||
LOGINFO("Homestore replica={} setup completed", g_helper->replica_num()); | |||
g_helper->sync_for_test_start(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this as we are not truncating when raft calls compact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let`s first make a decision on this topic
https://github.com/eBay/HomeStore/pull/515/files/c3e0ab3f3210c42a63b6b5f6a97386115f1126a9#r1724163097
then we can revisit this. if we need do a real truncate, then this can be removed
a8f5fcb
to
1f85dd3
Compare
65453a2
to
6b10cae
Compare
97dfa30
to
65f7777
Compare
@@ -1050,7 +1050,13 @@ std::pair< bool, nuraft::cb_func::ReturnCode > RaftReplDev::handle_raft_event(nu | |||
if (entry->get_val_type() != nuraft::log_val_type::app_log) { continue; } | |||
if (entry->get_buf_ptr()->size() == 0) { continue; } | |||
auto req = m_state_machine->localize_journal_entry_prepare(*entry); | |||
if (req == nullptr) { | |||
// TODO :: we need to indentify whether this log entry should be appended to log store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is #1 safe here ? as we dont have term in the rreq, how can we ensure this is not a re-written
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have term in rreq. every rreq is identified by {originator, term , dsn}
@@ -264,8 +264,8 @@ raft_buf_ptr_t HomeRaftLogStore::pack(ulong index, int32_t cnt) { | |||
[this, &out_buf, &remain_cnt]([[maybe_unused]] store_lsn_t cur, const log_buffer& entry) mutable -> bool { | |||
if (remain_cnt-- > 0) { | |||
size_t avail_size = out_buf->size() - out_buf->pos(); | |||
if (avail_size < entry.size()) { | |||
avail_size += std::max(out_buf->size() * 2, (size_t)entry.size()); | |||
if (avail_size < entry.size() + sizeof(uint32_t)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you convert your comment to code comment also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack ,will do it in the later pr
// destroyed for ever. we need handle this in raft_repl_dev. revisit here after making changes at | ||
// raft_repl_dev side to hanle this case. this is a workaround to avoid the infinite loop for now. | ||
if (i++ > 10 && !force_leave) { | ||
LOGWARN("Waiting for repl dev to get destroyed and it is leader, so do a force leave"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the log is not accurate as this will be run on every replica, not only leader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes , it a mistake, it can be run on any replica. will change this in the later PR
1 enable home_raft_log_store UT and add it to docker file and nightly runing
2 remove android directory in github CI in the stage of building cache(previous in create-and-test-package) ,so the space in github CI VM will be definitely freed for every pipeline, since one of CI pipeline does not have create-and-test-package .
3 increase the disk size to 2GB for UT
4 fix log dev flush bug. add a loop to create multiple logGroup to make sure all the expected logs flushed .
5 fix a bug in home_Raft_log_store#pack to make sure the available size of packing buffer is able to hold entry.size() and the length of this entry
6 use truncation instead of flush when home_raft_log_store#compact. this is only a in-memory change. the real truncation will be scheduled by resource manager. also create issue to handle the start_index case when recovery from a crash #530, wo do it in a separate PR.
7 add commit_config log so we can see when a config change is made.
8 fix a bug in handle_raft_event. we need to indentify whether a log entry should be appended to log store. if the req#lsn is not -1, it means this log has been localized and appended before, we should skip it.
9 add a force_leave api for now to handle the case that a follower and a leader has destroyed the raft goup , but the second follower fail to receive this message and will stuck. this is used for fixing raft_repl_dev UT. we can revisit here later if necessary.
10 fix the bug of raft_repl_dev UT when setting last_committed_idx in write_Snapshot_data in follower side. we should get the commited_index from raft_server itself. in our UT, commit_config is not taken account when increasing the commit_count.
11 fix bug bug of raft_repl_dev UT in read_snap_shot of leader. when we search the next start lsn which is the start of sending data. we can not use std::map::find , since if the requested next_lsn is a config_change , it will not be put into kvDB(lsn_index_) and as a result , std::map::find wil return std::end() and nothing will be sent to follower. instead , we should use low_bound, so that we can get the first data kv to be sent , config will be skipped.