So the summary is "JPEG encoder written in assembly with NEON instructions saves images faster than Apple's encoder."
That's a cool feat and is a little damning for Accelerate.framework, although the way techcrunch writes it I expected a new kind of fast cosine transform.
Don't forget that SnappyCam pumps both CPU cores when available.
The actual DCT algorithm created and used in the app is different to the typical AAN (Arai, Agui, Nakajima) DCT algorithm that's used in JPEG codecs, at least all the ones I've seen.
It's all about doing as little work as possible to achieve the end result. That's why there's so much asm implementation, with carefully chosen NEON instructions for each step.
Think of it as a cross-layer optimization between algorithm and implementation... done by hand. :-)
Really interested in the nuts and bolts - are you optimizing specifically for one quality setting (in which case I'm guessing you could probably do the quantization as part of the dct and throw away some calculations)?
I played with a realtime jpeg compression implementation back in college on transputers (yes I'm that old). Fun stuff, nice to see there are still places where going right down to the metal can make a real impact on a product...
While SnappyCam has been the most difficult, complex, piece of software I've written since I started coding in my early teens, it's also been one of the most satisfying technically.
I'd love to disclose the many, many optimizations baked in, but as this is a commercial app I must keep much of it as a trade secret.
I will say though that a lot of precomputation was involved, both for the encoder and decoder. Jumped at the chance to avoid computation, memory reads, etc., as much as possible. :-)
That's a cool feat and is a little damning for Accelerate.framework, although the way techcrunch writes it I expected a new kind of fast cosine transform.