-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block splitter #4136
base: dev
Are you sure you want to change the base?
Block splitter #4136
Conversation
mmmh,
I suspect the problem is that the macro I'm not completely sure what's the wanted behavior here... edit: confirmed that, when I change the default block size to anything other than 128 KB, it breaks this test. |
instead of ingesting only full blocks, make an analysis of data, and infer where to split.
for better portability on Linux kernel
though I really wonder if this is a property worth maintaining.
Who knew adding a single source file ( currently blocked trying to get the single-file library builder to work, and then each and every build system also requires updating its own list of files in its own format and location. |
short term simplification
for easier local testing
Weird stuff :
It only happens during compilation of the The failure seems to correspond to where And of course, it happens all the time on github CI, but not on any other system I can test the same code and build rule with. |
ideally, this workspace would be provided from the ZSTD_CCtx* state
let's fill the initial stats directly into target fingerprint
All tests passed, ready for review |
Instead of ingesting full blocks only (128 KB),
make an a-priori analysis of the data,
and infer a position to split a block at a more appropriate boundary.
This can notably happen in an archive scenario, at the boundary between 2 files of different nature within the archive.
This leads to some non-trivial compression gains, for a correspondingly acceptable speed cost.
The benefit is higher when there isn't already a post-splitter (like in higher
btopt
levels and above (16+)),but even when a post-splitter is active, there is still some small compression ratio benefit, making this strategy desirable even for higher compression modes.
However, this input analysis is not free. Therefore, it's currently reserved for higher compression strategies (currently
btlazy2
and above), where the speed cost is considered "negligible" (< 5%).For other modes, the analysis is skipped, and replaced by a static split size, since it's no longer limited to 128 KB only. Through tests, it appears that a static 92 KB block size brings higher compression ratio, at a small-to-negligible compression speed loss (mostly due to increased nb of blocks, hence of block headers).
Here are some benchmarks, focusing on compression savings:
silesia.tar
:dev
PR
calgary.tar
:dev
PR
Follow up :
splitting
strategy selectable via compression parameter