Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block splitter #4136

Open
wants to merge 26 commits into
base: dev
Choose a base branch
from
Open

Block splitter #4136

wants to merge 26 commits into from

Conversation

Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented Sep 3, 2024

Instead of ingesting full blocks only (128 KB),
make an a-priori analysis of the data,
and infer a position to split a block at a more appropriate boundary.

This can notably happen in an archive scenario, at the boundary between 2 files of different nature within the archive.

This leads to some non-trivial compression gains, for a correspondingly acceptable speed cost.
The benefit is higher when there isn't already a post-splitter (like in higher btopt levels and above (16+)),
but even when a post-splitter is active, there is still some small compression ratio benefit, making this strategy desirable even for higher compression modes.

However, this input analysis is not free. Therefore, it's currently reserved for higher compression strategies (currently btlazy2 and above), where the speed cost is considered "negligible" (< 5%).
For other modes, the analysis is skipped, and replaced by a static split size, since it's no longer limited to 128 KB only. Through tests, it appears that a static 92 KB block size brings higher compression ratio, at a small-to-negligible compression speed loss (mostly due to increased nb of blocks, hence of block headers).

Here are some benchmarks, focusing on compression savings:

silesia.tar :

level dev this PR savings note
1 73422432 73324476 -97956 static block size
12 58213308 58149669 -63639 last static block size
13 57994603 57720597 -274006 first dynamic block size
15 57176243 56905822 -270421 last dynamic block size alone
16 55338671 55256977 -81694 combine pre+post splitting
19 52887208 52842629 -44579 idem
22 52322760 52284324 -38436 idem

calgary.tar :

level dev this PR savings note
1 1143607 1129229 -14378 new static block size
12 930831 924326 -6505 last static block size
13 922061 914326 -7735 first dynamic size
15 921606 912055 -9951 last dynamic size alone
16 882353 881217 -1136 combine pre+post splitting
19 861336 859737 -1599 idem
22 861200 859578 -1622 idem

Follow up :

  • Consider multiple variants, with incremental speed / accuracy trade-offs
  • Make the choice of splitting strategy selectable via compression parameter

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Sep 4, 2024

mmmh,
this could be a problem:

test 45 : in-place decompression : Error : test (margin <= ZSTD_DECOMPRESSION_MARGIN(CNBuffSize, ZSTD_BLOCKSIZE_MAX)) failed

I suspect the problem is that the macro ZSTD_DECOMPRESSION_MARGIN() may make an assumption about block sizes being always full (128 KB), and derive an assumption about maximum expansion, which is no longer respected when block sizes are 92 KB by default (lower compression levels). In contrast, ZSTD_decompressionMargin() scans the compressed data, hence it probably ends up with a different result, and now both results differ.

I'm not completely sure what's the wanted behavior here...

edit: confirmed that, when I change the default block size to anything other than 128 KB, it breaks this test.
On the other hand, ZSTD_DECOMPRESSION_MARGIN() macro requires a parameter blockSize, and the test passes ZSTD_BLOCKSIZE_MAX, aka 128 KB, so it's pretty clear what it's expecting.
So the question is: how could this test be passed "reasonably", i.e. without inner knowledge of how the internal block splitting decision is done ?

instead of ingesting only full blocks, make an analysis of data, and infer where to split.
for better portability on Linux kernel
though I really wonder if this is a property worth maintaining.
@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Oct 17, 2024

Who knew adding a single source file (zstd_preSplit.c) would be such a big problem...

currently blocked trying to get the single-file library builder to work,
it doesn't include the new file, resulting in a link stage error.

and then each and every build system also requires updating its own list of files in its own format and location.

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Oct 17, 2024

Weird stuff :

error: undefined reference to '__mulodi4'

It only happens during compilation of the clang-asan-ubsan-fuzz32 test, aka undefined sanitizer enabled for 32-bit compilation on clang (same test with gcc compiles fine).

The failure seems to correspond to where * multiplication operations on S64 aka long long variables happen.

And of course, it happens all the time on github CI, but not on any other system I can test the same code and build rule with.

@Cyan4973 Cyan4973 marked this pull request as ready for review October 20, 2024 23:52
let's fill the initial stats directly into target fingerprint
@Cyan4973
Copy link
Contributor Author

All tests passed, ready for review

@Cyan4973 Cyan4973 changed the title Experiment : pre-splitter Block splitter Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants