Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How long does it take you to load 2 TB into SQLite? How long do queries take? I believe you, but I'm in disbelief that it could be close to as efficient as throwing into ram. I mean, an EMR cluster takes like 5 minutes to spin up.

Where do I learn how to do this? I've tried loading a TiB (one table one index) into SQLite on disk before, and it took forever. Granted this was a couple years ago, but I must be doing something fundamentally wrong.

I want to try this out. I've got 6TiB here of uncompressed CSV, 32 GiB ram. Is this something I could start tonight and complete a few queries before bed?

Actually, out of curiosity, I looked it up on the sqlite site. If I'm reading the docs correctly, with atomic sync turned off, I should expect 50,000 inserts per second. So, with my data set of 50B rows, I should expect to have it all loaded in ... Just 13 days. What am I missing?



There was a little bit of unfair comparison. I didn't have to load the data over a busy network connection, and I didn't have to decrypt the data once it was loaded (I had the benefit of a firewall and the raw data on a USB3 hard drive). I think there was a conversion from CSV to parquet on the cluster as well. And the engineer who set up the cluster was multitasking so I'm sure there was some latency issues just from that. But my analysis still only took a few hours (5? Maybe 6?).

There are a handful of things that make a difference. First of all, don't use inserts, use the .import command. This alone is enough to saturate all the available write bandwidth on a 7200rpm drive. It is not transactional, so you don't have to worry about that...it bypasses the query engine entirely, really is more like a shell command that marshals data directly into the table's on disk representation. You can also disable journaling and increase page sizes for a tiny boost.

Once imported into SQLite you get the benefit of binary representation which (for my use case) really cut down on the dataset size for the read queries. I only had a single join and it was against a dimensional table that fit in memory, so indexes were small and took insignificant time to build. One single table scan with some aggregation, and that was it.


Good tips. I'll give it a try!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: