This looks really awesome. According to google scholar it went through peer review. Seeing that it would improve compression (linear time LZ77) would be cool but not only does it do that but also compute search index data structures. https://scholar.google.com/scholar?hl=en&q=Space-Efficient+C...
LZ77 was already linear time using hash tables for the search (e.g. lzop [1], LZ4 [2], LZF [3], and other similar LZ77 compressors using the same technique).
Those fast LZ77 I pointed before use linear time (O(n)) and constant space size (fixed-size hash tables) -which is O(1) space, even better than O(n) -linear- space complexity-.
To get perfect parsing LZ, you need a lot of space to store backreferences. When you increase the window size, the space grows a lot. I don't know what zstd does, but I would love to see it improved by algo magic.
Hi, a note from one of the coauthors of the paper ;-)
Yes, it is a theoretical paper and then the presentation focuses on what is needed in such a paper: to establish that the big-Oh complexity can be reached. Some theoretical papers will never become practical, but we believe this one can. However, this will require a fair amount of algorithm engineering, to take the important theoretical ideas and implement them with common sense, changing theoretically appealing but practically useless modules by others that work better, even if they do not reach the desired worst-case complexity. A promising aspect of this algorithm is that it is easily parallelizable (the batched queries). A postdoc of mine that has experience in multithreading and compact data structures is already working on this. I won't dare to say how competitive will be the resulting LZ-parsing algorithm, though.
Another well-known case is the first paper on the FM-index, with constants like sigma^sigma multiplying the space, and its current realizations, e.g. Burrows-Wheeler Aligner.
In this case, I think it will not be that hard, but still it will not be a matter of just translating the algorithm into code.
Can somebody who read the paper in depth comment on whether the result is practical or not? Since it's a TCS paper, it's hard to gauge whether the constant factors are palatable or not, and there are many theoretical CS papers that give asymptotically optimal or significantly improved algorithms that have astronomical constant factors (e.g. matrix multiplication, etc.)
The compressed suffix tree (CST) supports a number of search algorithms that run in the order of the query length irrespective of the size of the corpus which is being searched. That the suffix tree is compressed allows it use space that is often not much larger than the input sequence, and this effect is improved in the case of inputs with internal repetitions, which seems to be the norm in the case of naturally-arising data (i.e. almost anywhere that Zipf's law holds).
As the authors point out, existing algorithms for the construction of the compressed suffix tree require space that is proportional to the alphabet size. When we work with limited alphabets, like DNA, this factor is small and we tend to not worry about it. Many real-world data sets do have larger alphabet sizes, and so there is a need to efficiently generate the CST for them. Removing the alphabet-size bound on the space required for construction is a big deal. Existing methods require huge amounts of space to construct the CST or CSA (compressed suffix array) for large alphabets, and are barely practical to use.
In practice results of this kind have proceeded implementations that are usable by a year or more. Libraries like sdsl-lite have accelerated the rate at which implementations get into the hands of compressed data structure researchers in much the same way that R did for statistical models, so hopefully we can see benefits of this result sooner.
Unfortunately, it will take a deeper reading of the paper by someone who is more intimately involved in this research to be able to describe exactly why this result may not be practical. A quick read does not suggest any obvious problems, the authors have a good track record and are very positive about many follow-on results (see conclusions).
SDSL-lite is a fantastic library, Simon (the main author) is one of my colleagues and spends a lot of time on making it easy to use: https://github.com/simongog/sdsl-lite
Without having read the paper in detail, SODA (the conference where it was presented) papers do have a reputation for being very theoretical, though, so it may not be implemented any time soon.
I'm assuming you meant GPL? I've got good news for you then, the next version is going to be under a BSD license: https://github.com/xxsds/sdsl-lite (currently unstable!)
As long as you can express some information as lists of symbols, you need to search patterns in it and to store in a compressed form, particularely for biological molecules which are incredibly long (your DNA is nanometers large and meters long!)
I used to play with suffix arrays a long time ago. I wanted to accelerate grep on a gigabyte text file. The tool was called "sary" (short for suffix array) and still exists on a forgotten SourceForce page. Good tool, it was able to find any substring in a huge file instantly.
They solve different problems. In particular, ripgrep and friends are designed to search arbitrary and potentially large directories with no pre-computation. They run in time ~linear in the size of the files to be searched.
The paper under discussion here is about a new way to create an index that also takes time ~linear in the size of the files to be searched, although presumably with a higher constant factor than just searching those files. After you have the index it's possible to search it in time linear in the length of the query rather than the files. This is much faster, but requires storing an index that is at least as large as the original file set, and keeping it up to date as things are changed etc.
Can you retrieve the original file from the index? If yes it's maybe interesting to store files like that for some databases by default and retrieve the original file only when needed.
I haven't read this paper, and haven't worked with the compressed variety of suffix trees/arrays. That said, I'm confident it's possible to retrieve the original file from a normal suffix tree, although it would be pretty slow. I imagine it must be possible to retrieve the file from the compressed version as well.
If nothing else, I think it is possible to retrieve a complete file from any index that lets you search for substrings and get a full string and position in the file back. Just search for all length 1 substrings, get a map from position -> string, then reconstruct the file from that.
I doubt it's worth storing files this way, because turning the index back into the file sounds very slow. I'd rather just store the file and the index, the time v. space tradeoff seems like a good one if you really care about search performance. That said, I use The Silver Searcher, and it's fast enough with no index that I don't think any of this stuff is worth the effort for searching text files on a file system.
most speed gain tricks in ripgrep have nothing to do with string algorithms. you can save a lot of time by parallelizing your filesystem walks or by using some system call tricks for reading files as quickly as possible.
> most speed gain tricks in ripgrep have nothing to do with string algorithms
There are actually several tricks at the "string algorithm" level. I didn't invent any of them, but no other search tool combines all of them AFAIK. They are:
* Heuristic for choosing the skip character in boyer moore. (Pick rare bytes.)
* Roll UTF-8 decoding into the regex automaton.
* Use SIMD for multiple string literal search. (This is an unpublished algorithm from Intel's Hyperscan project.)
There was also significant work required for scaling glob matching. Some popular repositories have gitignore files with over 3000 rules.
These are important because ripgrep isn't just for searching code repositories. It should be able to do most things that grep can do, but faster. (The smarter UTF-8 handling is particularly beneficial when searching non-ASCII text.)
Within the context of this thread though, and as someone who has written a linear time suffix array algorithm, the suffix array isn't particularly well suited to the problem of code search. Its constant factors are just too high. Generating an index would take a long time. (Although I haven't read the OP yet.)
> some system call tricks for reading files as quickly as possible.
ripgrep does two different types of searches. One uses memory maps. The other uses standard `read` calls. The latter is actually faster on Linux when searching code repositories. So no real tricks there. :-)
Is your suffix array implementation open source/available online anywhere? I'm always curious to see what other people's version of these sorts of things look like.
(I wrote an implementation of Ukkonen's suffix tree algorithm in C as maybe my second C program, it was pretty frustrating for a long time)
> (I wrote an implementation of Ukkonen's suffix tree algorithm in C as maybe my second C program, it was pretty frustrating for a long time)
Oh dear. You are brave. I briefly considered implementing Ukkonen's algorithm, but ran away scared. :-)
The reason why I wrote the SAIS implementation was because I am generally interested in text search, and because I was interested in exploring the idea of building an index around suffix arrays. Once I finished my implementation and realized that the cutting edge SACA algorithm (at the time) was as slow as it was, I mostly gave up that idea.
Since then, I've been itching to find use cases for it. As a former computational biologist, I'm familiar with how they are used in that field, but I've been looking for other places as well. Haven't found one yet, although I can think of some specialized (hypothetical) things where it might be useful.
It will take me a while to read the paper. I saw there is no code inside. Do they have code that improves LZ parsing as used in all general purpose compressors? If they don't have code, can their algorithm be implemented to improve performance of zstd or lzma?
May be it could be related to a output-code post-processor (e.g. instead of using generic entropy coding postprocessing, compress indexes with an new specific algorithm). If this allows good compression without an additional post-processing step (e.g. avoiding Huffman, arithmetic coding, etc.), it could be very interesting.
When theoreticians talk about LZ, they mean unlimited LZ, not LZ within a limited window. If you are only interested in data compression, the existing algorithms with limited window size should be much faster, use less memory, and achieve similar compression, except in some rare cases.
The primary motivation for unlimited LZ is text indexing. You can build text indexes based on LZ parsing and achieve ridiculous compression ratios with repetitive data (e.g. versioned documents). The query performance depends on the length of the dependency chains in the parsing, so you want to use as long windows as possible.
I skimmed the abstract and the paper and didn't see any, but that doesn't surprise me either. This is a theoretical computer science paper. It doesn't need code to show correctness, as long as they prove their claims. I'm curious why you're seeking such code?
edit: why the downvotes? Theoretical CS academics basically do math and very few write any actual code to supplement their papers.
A combined compression and indexing scheme that actually has code is Succinct by Berkeley [0]. I remember being impressed by the tradeoffs you can dial in and its performance in a 2015 presentation [1].
Thank you. Yes, I understand that theoretical computer science is math and the result is a proof. I'm more used to papers where the interesting part is a benchmark, with a "proof of correctness" part I can safely skip. And in order to produce the graphs of the benchmark results, they have to implement it.
You can describe algorithms using normal English perfectly fine. You can then use normal English to prove the correctness of your claims. This is what almost all papers about algorithms from the theoretical CS community do.
It is very likely that this will eventually end up in sdsl-lite and/or other compressed data structures libraries https://github.com/simongog/sdsl-lite
If you can, you should upgrade from DEFLATE (the compression format in zlib, gzip, zip files, png, office docs, etc) to something more modern like zstd.
If you must preserve ability to decompress by zlib, there are different implementations of a DEFLATE compressor that are better than zlib (either faster or better compression ratio).
A DEFLATE compressor needs to keep a data structure that is able to answer the query "where in the last 32,768 bytes of input there are occurrences of the exact 3 bytes I'm looking at? If possible, check whether the 4th byte matches too and find the longest match. Among matches equally long, prefer those which are closer". For a 32KB window, a data structure that always finds the best match is not large, unless you're on a microcontroller. The size of the data structure becomes a problem when the window is large.
Regarding reducing run time, zlib/gzip already use a O(n) algorithm (Rabin-Karp rolling hash search), despite being "slow" in comparison with other "pure-hashing" LZ77 implementations (e.g. LZO, LZ4, LZF).
I don't have access to that paper, but "compressing indexes" is likely to be slow, no matter how is implemented: fastest LZ77 using hashing relies on doing few operations per input byte. So no matter if the hash table space could be reduced or not, the speed would not be likely to be faster (anyway, you can trade hash table size vs compression ratio, even without that new technique).