Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Minimalist Guide to Lossless Compression (2019) (marksblogg.com)
99 points by marklit on Dec 13, 2021 | hide | past | favorite | 22 comments


>"Also, consider that if you can only read at 100 MB/s off a mechanical drive but your CPU can decompress data at ~500 MB/s then the mechanical drive is able to provide 5x the throughput you'd otherwise expect thanks to compression."

I'd not really thought of that aspect before... My old brain is hard-coded to save cpu cycles ... Time to change my ways :)


This is an older chart of ZSTD But it is an inspiring visualisation - compression algorithm "times" transmission bandwidth => transmission speed.

http://fastcompression.blogspot.com/2015/01/zstd-stronger-co...

Taken from the fastcompression blog - where one could follow ZSTD's author since before ZSTD was even conceived.

"Conveniently" enough the author of the blog has written both ZSTD and LZ4, which top the chart for their respective link speed domains. (2015 data - things have improved in both ZSTD and others since then.)


This was already a common technique back in the 90's for loading asset data in games and even more important than today with fast SSDs. Instead of many small files, the asset data is loaded from a single big compressed archive file, this works around the terrible small-file-performance (mainly in Windows file systems), and it speeds up loading because loading the compressed data and decompressing is much faster than loading the decompressed data directly from the slow disc.


A variation on this also still is extremely common in the case where the ‘file system’ is a web server and the program accessing it a web browser (https://css-tricks.com/css-sprites/)

In that case, there typically isn’t additional explicit compression (1). The main gain is in decreasing the number of http requests.

(1) the image itself may have inherent compression, and that may be improved by combining images with similar content, and the web server may be configured to use compression, but the first typically isn’t a big win, and the second is independent from this strategy.


Everything old is new again, so maybe you're just in time. Consider we have NVMe drives which can read and write 4-5 GB/s[1], that whole equation changes again...

[1]: https://www.anandtech.com/bench/SSD21/3017


zstd at compression level 1 can do ~2GB/s per core, and as time goes on and processors get more and more cores, compressing data by default is a valid proposition.

In fact, if you install Fedora 35 on btrfs, zstd:1 is enabled by default, using fs-level heuristics to decide when and when not to compress, reducing write amplification on SSD drives and gaining some space for free with negligible performance impact, which is nice.

My 8GB ~/src directory on encrypted btrfs on NVMe uses 6GB on disk and I can easily saturate the link while reading from it. Computers are plenty fast.


> zstd at compression level 1 can do ~2GB/s per core

So unless you can multithread that workload, it's already behind by a factor of 2.

> Computers are plenty fast.

My point was you can no longer assume the disk is significantly slower, at least for streaming workloads. You can often still win by spending CPU cycles doing clever stuff, but it's not several orders of magnitude difference like it used to be.


Most zstd implementations are multithreaded for single compression and decompression tasks. My home server has 24 cores and is using 16x PCIe 3.0 for storage (16 GB/s). Through benchmarking I found that I am storage bound from zstd-3+ and compute bound from zstd-2 and zstd-1. I run zstd-3.


Dumb question, but shouldn't it be the opposite? The lower the compression level, the faster the CPU can process the data, making storage the bottleneck. Higher compression level -> CPU becomes the bottleneck.


I believe in the TrueNAS interface I use these are negative compression levels. I could be wrong. It was just some empirical observations. I may flesh out my matrix more to paint a better picture.


There exist network connections at 80Kb/s. Still data repositories.


This idea is very powerful. Creating small compressed batches of transactions also allows you to write to disk in terms stated as "Transactions per block IO". When dealing with storage media that can wear out at the block grains, this can save substantial device lifetime.


Lossless compression is apparently equivalent to general intelligence:

http://mattmahoney.net/dc/rationale.html


Yes, but the compression process can be split into two components:

1. modelling (you convert data to symbols and predict probabilities of seeing each symbol)

2. entropy coding (convert symbols + probabilities to minimum number of bits)

The second stage is relatively simple, and basically solved. We went from Huffman's approximation to Arithmetic Coding which is theoretically ideal, and now got ANS which is close to ideal and pretty fast too.

The modelling part is indeed a completely open problem, and at this point as much art as science.


Most compression codecs these days have stopped using ANS because the benefits aren't worth the tradeoffs. Turns out a good arith coder is much faster.


"Turns out a good arith coder is much faster". Citation needed.


FLAC isn't turning into Skynet that soon.

Besides, humans are lossy. We forget, make mistakes, think via heuristic.


Doesn't LZFSE stem from the Finite State Entropy library Yann Collet (LZ4, ZSTD) wrote as a base for ZSTD (together with HUFF0) and Apple just decided to use it before it was fully mature? So shouldn't LZFSE be a predecessor to ZSTD in the Tree?


Could anyone explain why LZ77 is preferred by implementors versus LZ78?

It seems important for compressibility to prepare the data for maximum self-similarity, in addition to the LZ algorithms (as evidenced by the sort in this article). Could someone point towards a good modern summary of the approaches or heuristics?


I was going to answer "patents" because that's what I had read in the past (that LZ '78 was encumbered by patents while LZ '77 was not), but this wikipedia talk page: https://en.wikipedia.org/wiki/Talk:LZ77_and_LZ78#preference_... suggests that it's not just patent encumbrances.


LZW just doesn't compress that well. In LZW you only can use a dictionary built gradually from length-limited scraps of the previous data, but in LZ77 you can directly reference any of the previous data of any length (within implementation limits).


>Entropy, an Information Theory term coined by Claude Shannon in 1948, describes the minimum number of bits, on average, needed to encode a dataset.

Shannon didn't coin the term entropy. He borrowed it from the analogous definition in thermodynamics.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: