Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We use Apache Arrow at my company and it's fantastic. The performance is so good. We have terabytes of time-series financial data and use arrow to store it and process it.


We use Apache Arrow at my company too. It is part of a migration from an old in-house format. When it works it’s good. But there are just way too many bugs in Arrow. For example: a basic arrow computation on strings segfaults because the result does not fit in Arrow’s string type, only the large string type. Instead of casting it or asking the user to cast it, it just segfaults. Another example: a different basic operation causes an exception complaining about negative buffer sizes when using variable-length binary type.


This will obviously depend on which implementation you use. Using the rust arrow-rs crate you at least get panics when you overflow max buffer sizes. But one of my enduring annoyances with arrow is that they use signed integer types for buffer offsets and the like. I understand why it has to be that way since it's intended to be cross-language and not all languages have unsigned integer types. But it does lead to lots of very weird bugs when you are working in a native language and casting back and forth from signed to unsigned types. I spent a very frustrating day tracking down this one in particular https://github.com/apache/datafusion/issues/15967


Hey, Arrow developer here. If you get a segfault with our codebase, then please report an issue on our GitHub issue tracker.

(if you have already done so and it wasn't resolved, feel free to ping me on it)


Hey I just got back into work after the long weekend!

I thought a colleague of mine had filed an issue but I didn’t find it. I filed it myself just now: https://github.com/apache/arrow/issues/49310


stumbled upon it recently while optimizing parquet writes. It worked flawlessly and 10-20x'd my throughput




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: