Hacker News Clone

Bzip3 – A better and stronger spiritual successor to bzip2

by palaiologos on 5/10/2022, 7:19:12 AM with 104 comments

by hannob on 5/10/2022, 2:01:53 PM
It seems somewhat suspicious that the benchmarks don't compare to zstd.
It's not entirely clear to me what the selling point is. "Better than bzip2" isn't exactly a convincing sales pitch given bzip2 is mostly of historic interest these days.
Right now the modern compression field is basically covered by xz (if you mostly care about best compression ratio) and zstd (if you want decent compression and very good speed), so when someone wants to pitch a new compression they should tell me where it stands compared to those.
by klauspost on 5/10/2022, 12:32:32 PM
Looks interesting, but my main objections to general adoption the same as bzip2, lzma and context modelling based codecs - decompression speed.
Compressing logs for instance, decompression speed of 23MB/s per core, is simply too slow when you need to grep through gigabytes of data. Same for data analysis, you don't want your input speed to be this limited when analysing gigabytes of data.
I am not sure how I feel about you "stealing" the bzip name. While the author of bzip2 doesn't seem to plan to release a follow-up, I feel it is bad manner to take over a name like this.
by lynguist on 5/10/2022, 10:59:35 AM
If anyone just cares for speed instead of compression I’d recommend lz4 [1]. I only recently started using it. Its speed is almost comparable to memcpy.
[1] https://github.com/lz4/lz4
by pcwalton on 5/10/2022, 6:24:33 PM
The Burrows-Wheeler transform, which was the main innovation of bzip2 over gzip, and which this bzip3 retains, is one of the most fascinating algorithms to study: https://en.wikipedia.org/wiki/Burrows-Wheeler_transform
It hasn't been used lately because of the computational overhead, but it's interesting and I'm glad that there's still work in this area. For anyone interested in algorithms it's a great one to wrap your head around.
by Klasiaster on 5/10/2022, 12:59:14 PM
Here some other BWT compressors in the large text compression benchmark (look for "BWT" in "Alg" column): http://mattmahoney.net/dc/text.html
And here a BWT library with benchmarks: https://github.com/IlyaGrebnov/libsais#benchmarks
by denzquix on 5/10/2022, 9:42:27 AM
From their own benchmarks it seems more like bzip3 is geared towards a different compression/speed trade-off than bzip2, rather than an unambiguous all-around improvement. Am I misreading it?
by joelthelion on 5/10/2022, 9:41:55 AM
In the Era of zstandard, do we really need this?
by yakubin on 5/10/2022, 2:54:58 PM
From the "disclaimers" section:
> Every compression of a file implies an assumption that the compressed file can be decompressed to reproduce the original. Great efforts in design, coding and testing have been made to ensure that this program works correctly.
> However, the complexity of the algorithms, and, in particular, the presence of various special cases in the code which occur with very low but non-zero probability make it impossible to rule out the possibility of bugs remaining in the program.
That got me thinking: I've always implicitly assumed that authors of lossless compression algorithms write mathematical proofs that D o C = id[1]. However, now that I've started looking, I can't seem to find that even for Deflate. What is the norm?
[1]: C being the compression function, D being the decompression function, and o being function composition.
by asicsp on 5/10/2022, 8:34:09 AM
Good work!
I was also confused with faster speed claims than bzip2, and then saw the discussion in the issue: https://github.com/kspalaiologos/bzip3/issues/2
by williamkuszmaul on 5/10/2022, 2:15:21 PM
One of the things that's cool about Bzip is that it makes use algorithmic techniques developed by theoretical computer scientists in order to perform the Burrows Wheeler Transform efficiently. It's a great example of theory and practice working symbiotically.
by forgotpwd16 on 5/10/2022, 3:32:04 PM
>better, faster
If I'm reading the benchmarks correctly, it gets higher compression but is slower and has higher memory usage. Thus cannot call it better.
>spiritual successor to BZip2
What does that mean? If it isn't related to bzip2, why choose this name?
by fefe23 on 5/10/2022, 1:36:19 PM
Hmm, I see LZ77, PPM and entropy coding in the description, and obviously Burrows-Wheeler.
Has anyone tried doing zstd at the end instead of LZ77 and entropy coding?
Does the idea even make sense? (I'm a layman)
by iruoy on 5/10/2022, 12:50:41 PM
So bzip2 and bzip3 focus on compressed size, lz4 on compression speed and zstd on decompression speed?
by jkbonfield on 5/13/2022, 10:33:04 AM
It doesn't compare itself against bsc, which feels a bit poor IMO given it's using Grebnov's libsais and LZP algorithm (he's the author of libbsc).
On my own benchmarks, it's basically comparable size (about 0.1% smaller than bsc), comparable encode speeds, and about half the decode speed. Plus bsc has better multi-threading capability when dealing with large blocks.
Also see https://quixdb.github.io/squash-benchmark/unstable/ (and without /unstable for more system types) for various charts. No bzip3 there yet though.
by kstenerud on 5/10/2022, 9:58:51 AM
There comes a point where the complexity itself becomes too much of a liability. It's important to be able to trust these algorithms as well as all popular implementations with your data.
by AceJohnny2 on 5/10/2022, 8:50:47 PM
Will bzip3 be added to the Squash benchmarks?
https://quixdb.github.io/squash-benchmark/
I note that the "Calgary Corpus" that bzip3 prominently advertises is obsolete, dating back to the late 80s:
https://en.wikipedia.org/wiki/Calgary_corpus
by the-alchemist on 5/10/2022, 6:21:31 PM
I'm really interested in GPU-based compression / decompression.
Anyone know what the current SOTA GPU-based algorithms are, and why they haven't taken off?
Brotli has gotten browser support, so it seems to my naive self that a GPU-based algorithm is just waiting take over.
by oefrha on 5/10/2022, 9:59:14 AM
Interesting, this seems to be a good replacement for xz if the benchmarks are representative.
by joppy on 5/10/2022, 12:11:22 PM
Why is there such a big disclaimer/warning on the front? Shouldn’t the program just check that decompress(compress(x)) = x as it goes, and then it can be sure that compress(x) has not lost any data?
by 72deluxe on 5/11/2022, 9:48:59 AM
I use pbzip2 with gusto because the original bzip2 is single-threaded. I heartily recommend it to all I meet, even those in the street!
by rurban on 5/10/2022, 6:06:27 PM
Can be easily improved by using the HW crc32, it's just SW crc32.
by themusicgod1 on 5/10/2022, 8:16:22 PM
> Github link
...so long as this lives in NSA/Microsoft Github, it's not a 'spiritual successor' to anything.