Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable home raft log store UT #515

Merged

Conversation

JacksonYao287
Copy link
Contributor

@JacksonYao287 JacksonYao287 commented Aug 20, 2024

1 enable home_raft_log_store UT and add it to docker file and nightly runing
2 remove android directory in github CI in the stage of building cache(previous in create-and-test-package) ,so the space in github CI VM will be definitely freed for every pipeline, since one of CI pipeline does not have create-and-test-package .
3 increase the disk size to 2GB for UT
4 fix log dev flush bug. add a loop to create multiple logGroup to make sure all the expected logs flushed .
5 fix a bug in home_Raft_log_store#pack to make sure the available size of packing buffer is able to hold entry.size() and the length of this entry
6 use truncation instead of flush when home_raft_log_store#compact. this is only a in-memory change. the real truncation will be scheduled by resource manager. also create issue to handle the start_index case when recovery from a crash #530, wo do it in a separate PR.
7 add commit_config log so we can see when a config change is made.
8 fix a bug in handle_raft_event. we need to indentify whether a log entry should be appended to log store. if the req#lsn is not -1, it means this log has been localized and appended before, we should skip it.
9 add a force_leave api for now to handle the case that a follower and a leader has destroyed the raft goup , but the second follower fail to receive this message and will stuck. this is used for fixing raft_repl_dev UT. we can revisit here later if necessary.
10 fix the bug of raft_repl_dev UT when setting last_committed_idx in write_Snapshot_data in follower side. we should get the commited_index from raft_server itself. in our UT, commit_config is not taken account when increasing the commit_count.
11 fix bug bug of raft_repl_dev UT in read_snap_shot of leader. when we search the next start lsn which is the start of sending data. we can not use std::map::find , since if the requested next_lsn is a config_change , it will not be put into kvDB(lsn_index_) and as a result , std::map::find wil return std::end() and nothing will be sent to follower. instead , we should use low_bound, so that we can get the first data kv to be sent , config will be skipped.

@codecov-commenter
Copy link

codecov-commenter commented Aug 20, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 52.00000% with 12 lines in your changes missing coverage. Please review.

Project coverage is 68.21%. Comparing base (1a0cef8) to head (65f7777).
Report is 55 commits behind head on master.

Files with missing lines Patch % Lines
src/lib/logstore/log_dev.cpp 50.00% 6 Missing and 2 partials ⚠️
src/include/homestore/replication/repl_dev.h 0.00% 2 Missing ⚠️
src/lib/replication/repl_dev/raft_repl_dev.cpp 0.00% 0 Missing and 1 partial ⚠️
src/lib/replication/repl_dev/raft_repl_dev.h 0.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##           master     #515       +/-   ##
===========================================
+ Coverage   56.51%   68.21%   +11.70%     
===========================================
  Files         108      109        +1     
  Lines       10300    10433      +133     
  Branches     1402     1400        -2     
===========================================
+ Hits         5821     7117     +1296     
+ Misses       3894     2638     -1256     
- Partials      585      678       +93     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

}
#endif

m_log_store->truncate(to_store_lsn(compact_lsn));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in compact here , we need to truncate , which will update the start_index. flush will not update start_index

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done purposefully. We let resource manager do truncation @yamingk can confirm this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then , then start_index will not be updated on time, which will probably be wrong to upper layer.
for the view of upper layer , if it compact to lsn n , then the start_index it can see after campact should be lsn + 1.
so , I believe we should do truncate here, and not depend on resource manager

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be discussed with @yamingk

@@ -165,6 +165,7 @@ jobs:
- name: Build Cache
run: |
pre=$([[ "${{ inputs.build-type }}" != "Debug" ]] && echo "-o sisl:prerelease=${{ inputs.prerelease }}" || echo "")
sudo rm -rf $ANDROID_HOME
Copy link
Contributor Author

@JacksonYao287 JacksonYao287 Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the failure CI , Create and Test Package is not scheduled ,so sudo rm -rf /usr/local/lib/android/ will not be executed in that CI, which might cause a no-space left issue.

here , I add this to Build Cache phrase, which will be definitely scheduled in every CI pipeline

@sanebay

xiaoxichen
xiaoxichen previously approved these changes Aug 21, 2024
Copy link
Collaborator

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
3 issues are solved:

  1. clean up the android home so we can enlarge disk size to 2GB
  2. add for loop for flush, this make sense based on @JacksonYao287 's explanation and it is perfect we catch this in the re-enabled ut,
  3. for compact() call we do compact and move the flush into compact to avoid flushing logs before the start_index

@@ -264,8 +264,8 @@ raft_buf_ptr_t HomeRaftLogStore::pack(ulong index, int32_t cnt) {
[this, &out_buf, &remain_cnt]([[maybe_unused]] store_lsn_t cur, const log_buffer& entry) mutable -> bool {
if (remain_cnt-- > 0) {
size_t avail_size = out_buf->size() - out_buf->pos();
if (avail_size < entry.size()) {
avail_size += std::max(out_buf->size() * 2, (size_t)entry.size());
if (avail_size < entry.size() + sizeof(uint32_t)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avail_size should be able to hold entry.size() and the length of this entry, which is a uint32

src/lib/logstore/log_dev.cpp Show resolved Hide resolved
@@ -177,6 +177,7 @@ void HomeLogStore::on_log_found(logstore_seq_num_t seq_num, const logdev_key& ld

void HomeLogStore::truncate(logstore_seq_num_t upto_lsn, bool in_memory_truncate_only) {
if (upto_lsn < m_start_lsn) { return; }
flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a flush here ?

Copy link
Contributor Author

@JacksonYao287 JacksonYao287 Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the caller does not do a explicitly flush before truncating, then m_tail_lsn will not be updated and then truncating might not be able to truncate to the expected one.
add a flush here to make sure we do a flush before truncating. if flush is already scheduled before truncating, then the flush here will do nothing and just return.

}
#endif

m_log_store->truncate(to_store_lsn(compact_lsn));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done purposefully. We let resource manager do truncation @yamingk can confirm this.

@@ -878,14 +878,6 @@ TEST_F(RaftReplDevTest, BaselineTest) {
LOGINFO("Homestore replica={} setup completed", g_helper->replica_num());
g_helper->sync_for_test_start();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this as we are not truncating when raft calls compact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let`s first make a decision on this topic
https://github.com/eBay/HomeStore/pull/515/files/c3e0ab3f3210c42a63b6b5f6a97386115f1126a9#r1724163097

then we can revisit this. if we need do a real truncate, then this can be removed

@JacksonYao287 JacksonYao287 force-pushed the enable-home-raft-log-store-UT branch 3 times, most recently from a8f5fcb to 1f85dd3 Compare August 26, 2024 07:57
@JacksonYao287 JacksonYao287 self-assigned this Aug 27, 2024
@JacksonYao287 JacksonYao287 force-pushed the enable-home-raft-log-store-UT branch 3 times, most recently from 65453a2 to 6b10cae Compare August 28, 2024 09:40
@JacksonYao287 JacksonYao287 force-pushed the enable-home-raft-log-store-UT branch 2 times, most recently from 97dfa30 to 65f7777 Compare August 29, 2024 08:08
@@ -1050,7 +1050,13 @@ std::pair< bool, nuraft::cb_func::ReturnCode > RaftReplDev::handle_raft_event(nu
if (entry->get_val_type() != nuraft::log_val_type::app_log) { continue; }
if (entry->get_buf_ptr()->size() == 0) { continue; }
auto req = m_state_machine->localize_journal_entry_prepare(*entry);
if (req == nullptr) {
// TODO :: we need to indentify whether this log entry should be appended to log store.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is #1 safe here ? as we dont have term in the rreq, how can we ensure this is not a re-written

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have term in rreq. every rreq is identified by {originator, term , dsn}

@@ -264,8 +264,8 @@ raft_buf_ptr_t HomeRaftLogStore::pack(ulong index, int32_t cnt) {
[this, &out_buf, &remain_cnt]([[maybe_unused]] store_lsn_t cur, const log_buffer& entry) mutable -> bool {
if (remain_cnt-- > 0) {
size_t avail_size = out_buf->size() - out_buf->pos();
if (avail_size < entry.size()) {
avail_size += std::max(out_buf->size() * 2, (size_t)entry.size());
if (avail_size < entry.size() + sizeof(uint32_t)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you convert your comment to code comment also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack ,will do it in the later pr

// destroyed for ever. we need handle this in raft_repl_dev. revisit here after making changes at
// raft_repl_dev side to hanle this case. this is a workaround to avoid the infinite loop for now.
if (i++ > 10 && !force_leave) {
LOGWARN("Waiting for repl dev to get destroyed and it is leader, so do a force leave");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the log is not accurate as this will be run on every replica, not only leader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes , it a mistake, it can be run on any replica. will change this in the later PR

@JacksonYao287 JacksonYao287 merged commit fc1e7a2 into eBay:master Aug 30, 2024
43 of 44 checks passed
@JacksonYao287 JacksonYao287 deleted the enable-home-raft-log-store-UT branch August 30, 2024 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix raft_repl_dev UT enable home raft log store UT and fix failure
5 participants