Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ScyllaDB Closes $16M in Series B Funding (scylladb.com)
76 points by kermatt on March 8, 2017 | hide | past | favorite | 72 comments


If I was Datastax I would be scared of Scylla. They have momentum and they are one of the best engineering teams around outside of top teams at Google/FB etc.


Maybe I'm too far out of the loop but while they both provide enterprise Cassandra they both also offer a bunch of different products so in some product categories there may be plenty of room for both of them.

I gotta say though, going to each of their websites, and Scylla was very straight forward "here's what we provide" and I understand it pretty quickly. Datastax? I...still don't entirely understand what they offer. It's all loaded with enterprise buzz words.


Yeah they're pretty buzzword ish.

But last I heard they bought in the titandb guy to build their own enterprise graph database on top of cassandra.

So they got that going for them.


Oh I forgot about that! That's a big thing going for them. I feel like half the projects I've worked on in the recent past could benefit from a graph database.


I wouldn't.

Enterprise companies care about performance, sure. But far, far less than they care about being on a supported platform. DataStax has done the hard yards over the years to prove themselves capable of supporting Cassandra. SycllaDB has no street cred at all.

So SycllaDB might make some inroads in performance critical startups and web companies. But that is likely to be about it.


Aren't you being too dismissive? All database companies have to start from somewhere, Datastax was brand new once too.

This funding is part of the story to make sure Scylla can support customers but they have a much better foundation to build upon with this database tech.


Enterprise companies also care about OPEX.. i.e. if they have an internal SLA that can be met with a 100 node C* cluster or with 10 Scylla nodes, that will matter quite a bit.


Enterprise companies spend millions on database projects and majority of the cost comes from Professional Services and Consultancies engaged in data migration etc. So cost of the product itself is never the driving factor especially when the cost will be so substantially less already than Teradata or Oracle.

And the most important part of a database is confidence. You need to be able to trust that when you have a Production outage IT is able to talk to someone to fix it. DataStax has proven capable of meeting that task. It's very difficult for a second tier player to do the same especially with Cassandra being such a niche product.


I used to work for a database company and have been involved in the purchasing side of database support contracts. Professional Services/Consulting is a small revenue line (and TCO component) relative to support licensing. It also has considerably lower margins.

Scylla is an upstart... but I would not call them a second tier player. DataStax is already trying and failing [0, 1] to incorporate ideas from Scylla in to their product.

Yes, support and SLAs matter. And this round of funding will go a long way toward helping Scylla build out those parts of their business.

[0] - https://issues.apache.org/jira/browse/CASSANDRA-10989

[1] - https://issues.apache.org/jira/browse/CASSANDRA-8520


Doing more with the same budget is a big thing within any department. It's a way to show 'efficiency' thus garner points for promotion.

Overall company budget for a technology group is besides the point---individual departments are budget constrained and being able to say I saved 30% last quarter, and whilst handing 200% more operations is something any manager would love.


Exactly, I'm installing Cassandra right now.

But showed Scylla to my co worker. And he was like: holly cow! We need to install that!

And I was like: bag, we are at a bank. Forget it :)


Looks like this is the same team that brought us OSv (http://osv.io/). So is OSv abandoned now?


It's not abandoned, we continue to maintain and contribute to it but sadly, it's not our business focus in the present. I really want to see it to win centerstage one day in the future.

Few orgs do use it in production. It's really cool but along with it the cloud needs to change. For instance, we wanted to have a 1-minute-granularity price scheme from the cloud vendors so it will resemble a container (just with hardware-based security..). This any many needs to happen for it.


OSv and ... KVM


Congratulations to the team!

I would love to move over from Apache Cassandra to Scylla but honestly I'm a bit afraid to do that. I have no doubt that it's much faster but I haven't seen hard numbers about consistency and availability. Apache Cassandra is a much older project with many installations and is battle tested (to a degree) how can I be sure that Scylla will behave as stable as Cassandra in that regard?


We use 1.6 in production on AWS i3.16xlarges and it's great. We've seen a reduction to less than half the nodes and over 100k per node requests/sec (up from 4k req/sec for Cassandra). Very few issues and those that come up get immediate attention from core developers that actually wrote the code you're having issues with.

Absolutely 100% recommend.


We use v1.5 in production and it's been very stable. Team is responsive on issues (github and mailing list).

They dont have 100% parity yet with all cassandra features so waiting for that to move more work over to scylla, but performance per machine and lack of tuning hassle is very nice.


We don't use it for any production services, but for internal use it's been pretty darn solid. Really fast too, we benchmarked 80k/sec/node inserts on EC2.


Does anyone else find it a bit weird that a post on ScyllaDB's blog announcing their fundraising starts with: "ScyllaDB announced today that it?" It seems weirdly self referential to me. They're definitely not the only ones to do this though.

Anyways, congrats on the funding guys, certainly not trying to cast shade.


This is the standard format for a press release (note the URL of the article - scylladb.com/press-release/...)

https://en.wikipedia.org/wiki/Press_release


I guess that makes some sense. It's still pretty weird to me though, the big difference with a press release is that the press is writing about an announcement that the company made. So it makes sense to describe the company in third person. If the company is writing for themselves they should just announce that they want to announce rather than announce their announcement.


A press release is written by the company, then released for the press to further distribute and write about if necessary.


The founding team is Israeli I believe, and their team is distributed across the world, so this may not have been written by a native English speaker.


It was probably written by their PR firm.


Illeism is a device often used in PR/IR to inject a dose of impartiality in an otherwise biased content.


i assume that it is a public release meant to be easily copied as is by anyone (preferably as many as possible) without a hassle of rewriting it


I have noticed this trend as well. If I had to guess it's a SEO thing.


You know if Postgres can just get some better distributed/elastic support it could do some damage.

I switched from Cassandra to citus + pipelinedb (and I'm JVM guy). Postgres is such an awesome platform. I'm planning on looking into some logical decoding.


Postgres is a completely different type of database, there's not much comparison with cassandra/scylla at all.

It also has a way to go with just scale-up performance before it even gets to scale-out.


I wanted a db for realtime analytics.

I agree on scale-out but scale-up performance I will say Cassandra is not even close to cost to performance ratio of Postgres w/ extensions for a real time analytics.

I have legitimate 6months experience that I wasted on Driud/Cassandra and could not match Postgres in terms of performance.

I don't want to hear CAP this and that when I can do all sorts of stream processing a priori. Besides real SQL is easy to understand and hack with then many proprietary query languages.

I only tell people so they don't was time like I did.


I'm one of the founders at MemSQL, which is designed for real-time analytics. Take it for a spin and see if it works for you.


Thanks for the recommendation! I will check it out for sure!


We use memsql, if you're using citus + pipelinedb then memsql will likely solve your problem with a single better solution. Not open-source and uses mysql dialect instead of postgresql but definitely highly recommended.


> realtime analytics

This doesn't mean anything.

The only thing that matters is what (and how much) data you have and what queries you want to run. If a relational database can do that for you then cassandra/scylla isn't a good choice.


> This doesn't mean anything.

And neither does your practically trolling comment. You could say that about anything.

> The only thing that matters is what (and how much) data you have and what queries you want to run. If a relational database can do that for you then cassandra/scylla isn't a good choice.

Theory aside what matters is what I can get to work... so prior knowledge is a big deal.

All I said is it would be interesting if Postgres had a better story for elasticity and that it might be a good fit for many. I think many are in my camp and a relational db would fit but are all to often pushed towards to NoSQL.

Again I don't want to hear about CAP this and that. You can totally take either system and make it have many of the properties that our touted for each one. People take relational databases and turn them into schemaless or columnar stores all the time with eventually consistency (often with mysql). Particularly with Postgres as it is a platform (it has a powerful extension model).

Regardless Cassandra is touted for analytics. The marketing really pushes it for that. Eventually consistency and columnar store is a good fit for real time analytics. Lots of data points, lots of aggregation, get to play with various queries in realtime, elasticity... etc.

We will probably hit a wall with Postgres.

All and all I think Cassandra is pretty good for what it does. Other than Redis, and maybe RethinkDB I think its my third NoSQL favorite (and yes I know each of those guys has a sweet spot for what they are good at requirement wise).


It's not trolling - it's the fact that "realtime analytics" are just buzzwords and don't really mean anything. Analytics is just another word for queries and realtime is different for everyone. You can use anything from some in-memory code to redis to postgres or other exotic databases for this and I've yet to come across a situation where those words have helped provide much clarity.

Cassandra is wide-column (which is just another buzzword for key/value), not columnar (as in storing data in a column-oriented format) so it's actually not great with aggregations and barely supports queries like that. It is good for range scans across data in a single partition and for spreading load around the cluster if your data is also spread evenly into these partitions, but ultimately my point was that in a thread about cassandra/scylla, it doesnt make much sense to bring up a relational db because it's completely different in every way.

If it does work better for you, that's great - and it means is that cassandra/scylla was never a good fit to begin with. The multi-master global replication is a key feature that will likely never be reached by postgres (which is just starting to get scale-up and some logical replication features now) and even mysql only just released the group-replication for multi-master which still only supports the concept of a single total cluster.


Perhaps if you had elaborated like you have now I would not have thought it was trolling :)

There is a general consensus what realtime analytics is ( memsql.com apparently tries to define it). It certainly is less nebulous than "big data".

Our biggest problem was continuous aggregates. Continuous aggregates are tough for databases (particularly for Postgres since it is MVCC). So it isn't the relational model that is the problem but the algorithms needed for consistency that conflict with constant read and write speed.

I did goof by saying Cassandra was column oriented (that is a loaded and confusing term) but people do use it all the time for aggregates (see Druid). Druid by the way is apparently column oriented (going back to my point how you can most data stores into something else).

Saying Cassandra is completely different than Postgres isn't really saying something terribly useful. I bring up Postgres not because it is a relational database but because it has some nice features and extensions that seem to be cost effective (compared to just loading everything in memory ala redis which is not cheap).

Cassandra certainly does try to offer familiar things to old school SQL guys like myself (namely CQL and various options for consistency)... again it isn't completely different.

> If it does work better for you, that's great - and it means is that cassandra/scylla was never a good fit to begin with.

You are also assuming some stuff like that we didn't have to compromise. We will still need something like Cassandra as we do want to collect more data points and we do need a place to effectively warehouse this stuff across regions.

Also plain Postgres is not a good choice for continuous aggregates as I mentioned before (again I'm going to ignore theory of the relational model... the relational model fits for us because we make it fit... not the other way around). It was one of the reasons why we investigated other technologies.

> The multi-master global replication is a key feature that will likely never be reached by postgres

I'm sensing some bias here... never... maybe never for postgres core but certainly someone could build an extension or add on.


Do you mean you use pipelinedb and citus or do you mean you use citus to cluster pipelinedb?

I never thought to try mixing the two since pipelinedb has their own commercial product for clustering. You should write a blog post on that if you were able to cluster pipelinedb with citus.


I mix the two. Citus for offline warehousing and pipelinedb for realtime. I'm very pleased with the results.

I mix the two by using a message bus (Kafka + RabbitMQ)

I spent a lot of time with Cassandra. It is probably great tech I just don't have petabyte data yet. And I know all the stats I want a prior.


This is why I love HN. People give their real-world scenarios.

Thank you for recommending pipelinedb, I haven't come across them before.

I have kafka and postgres and in need of real-time analytics, this may be the solution I'm looking for.

I've used citus in the past, whilst it's excellent for data-warehousing. Unless you scale up the servers, then it's not suitable for real-time counts. I found that lacking for my use-case.

If you see this, can you reach out at my name @ gmail dot com, I'd like to chat about issues you have come across.


If you need both citus + pipelinedb then memsql.com will solve your problem with a much more polished solution. MySQL dialect instead of PostgreSQL but that's rarely a problem for a data warehouse.


Thanks. I actually have scheduled a demo. In the meantime will download the community version and have a play.


I don't get it. Is this just a faster Cassandra? What's their competitive niche?


An order of magnitude faster Cassandra with no GC pauses (because it's C++). It's a literal drop in replacement. Even nodetool is compatible.


Their marketing points are really understood and relatable to people who know what a pain operating Cassandra can be. The two biggest pain points IMO are read-repair and compaction - both of which must be run periodically and consume a huge amount of resources. Read Repair is especially a pain because (1) it must be run periodically or you risk losing data (2) it takes forever to complete in some deploys (ex - you must run read-repair in a certain (user-tunable) timeframe, the default of which is every 10 days - I have tables that take 7 days to complete a read-repair, meaning I have repairs pretty much running 24/7) and (3) there are no/few operational tools to manage read repair. The low-tech way is to write a cron job on every node - and even then there is no way to measure progress or detect if a job failed/completed without grepping logs - it's so bad that Spotify wrote a open source tool to manage it.

The solution has been to just buy more nodes (if you don't want long repairs, store less than 1TB of data per node) and faster disks. Read Repair maintenance is probably the only thing I hate about Cassandra - and seeing benchmarks that Scylla does these operations on the order of minutes rather than hours is attractive enough for most people (I don't think most deploys are even coming close to the benchmarked txn/s in real-world workloads, for both databases). Both compaction and repair tend to be CPU intensive (both work by essentially reading a ton of data), so I'd imagine the move to C++ and the core-per-thread design is more efficient.

In short, the operational efficiency is far more attractive even if you aren't pushing a trillion writes/sec.

I've been thinking about testing Scylla for a while, but unfortunately they don't support the features we support, and while our Cassandra deployment is a rather comparatively large cost, there are enough things on my plate right now where trading my current set of evils for other unknown ones isn't very attractive.

See this post by Discord App - https://blog.discordapp.com/how-discord-stores-billions-of-m... - where they are mentioning moving to Scylla from Cassandra for similar reasons. Performance is fine, but repair efficiency is more of the driving factor.

I'd also add that Cassandra advertises itself as a relatively high performance database for distributed workloads. If something like a faster Cassandra doesn't entice you, chances are you'd be better served by something like Postgres anyways.


What you keep calling "read repair" is actually just "repair" or the longer phrase "anti-entropy repair". "Read repair" is something different.


A faster, well engineered, drop in replacement for Cassandra is a pretty competitive niche in of itself, don't you think?


I guess but I don't really know. Lots of people have Cassandra in production. I suppose there is a certain segment of that market that needs a super high performance version, but is it really that big?


Cassandra is supposed to be a high-performance database, except it has some fundamental issues that keep it from being what it can be. ScyllaDB fixes those fundamental issues so anyone using cassandra can benefit from using scylla instead.


For many users, that "super high performance" really just means lower request processing latency, which is a very desirable feature to have in various market segments. Look at the various benchmarks available (or run the benchmarks yourself) and you will see that Scylla has a very consistent, low latency that's a direct result of its non-blocking, shared-nothing architecture and implementation in a non-manage language, which gives us more control.


Is there any change to the data model? Guessing still no secondary indexes for example.


They're working on getting to feature parity with cassandra by version 2.0

http://www.scylladb.com/technology/status/


Unfortunately that doesn't seem to give any indication of how long that might be. I just saw a post from one of their engineers about a year ago that said secondary indices would only take about 2 weeks to implement.


Where did you see that kind of estimate?

We are currently working on first finishing materialized view support, which will hopefully be completed in the upcoming months. Secondary indices will be implemented after that and we're hoping to reuse MV infrastructure for that. So I personally expect both features to land into a release later this year.



Thanks for the pointer!

Please note that @glommer is talking about the classic secondary index implementation in Cassandra, which is very simple but also broken. I don't know the details but we probably did have "half-ready" code for that. We decided against going forward with it because Cassandra had already moved to SASI (which is also much more complex). As I said, we're currently focusing on materialized views, and tacking secondary indices after that.

Btw, I highly recommend subscribing to our user mailing list or engaging on Github for questions and comments about features. You'll get better and up-to-date answers there.

Update: reading @glommer's reply carefully, he explicitly says that he's unsure if we'll move forward with that specific implementation: "_if_ we do implement it, it should land in our main version in a couple of weeks" (emphasis mine).


update: we didn't move forward with that implementation.

Secondary indexes will be implemented on top of Materialized Views. Patches for Materialized Views already exist, and are soon to appear in preview releases.


Thanks for the explanation


I like ScyllaDb. But I am not sure, why does it need to run on XFS only ?


The entire database is designed for performance with message-passing thread-per-core async architecture. XFS is the only filesystem that has good async support.

http://www.scylladb.com/2016/02/09/qualifying-filesystems/

http://www.scylladb.com/technology/architecture/


Yes, exactly. Scylla needs proper AIO/DIO support at the filesystem level and XFS has so far been the one that implements it best. There are fixes to Scylla to make it also run well on ext4 (by working around some of its limitations), but that's not part of any release yet.


Congratulations to the team.

Any word on native JSON types ? Scylla will have another reason to drop Cassandra.


JSON types are actually not very often requested feature so we have not prioritized it very high. I suspect that's because most people just store their JSON data as text and do processing in their applocation.

There's an open issue about it on Github:

https://github.com/scylladb/scylla/issues/2058

Please feel free to upvote and comment on the issue to voice your interest in the feature.


Some clarity - that issue is about interfacing with a row within a CQL table as a piece of json, not actually storing a schemaless json document (like in a special postgres-style json datatype).

Cassandra is actually schemaless but since the shift to CQL from Thrift, it's unlikely that it'll go back to a schemaless model again.

In the meantime, the Keen.IO crew has a nice model for storing lots of arbitrary json if that's something thats needed. It takes some work but a very clever strategy and they've made it work well.


Business plan.

1. Let someone else solve the hard distributed-system problems.

2. Re-implement the local pieces for higher performance.

3. Profit!


No one stopped that someone to work for higher performance and profit themselves.


Nobody stopped them; they were just busy doing other things that made the effort worthwhile. It's worth keeping that in mind, to make sure that competitive claims about performance don't drown out proper credit for the true innovators.

Note that I'm not saying anything is wrong here. Reimplementations of existing ideas are a time honored tradition, and often lead to their own innovations. Linux was a reimplementation of UNIX, and seems to have been good for a lot of people. Most web servers and browsers are reimplementations of things that had existed previously. From compilers and databases to filesystems and hypervisors, a lot of software we all rely on today - especially in open source - is a reimplementation of something or other. I'm pointing out an opportunity, not a flaw.


From their front page: "allow for perfect scale-up linear performance of up to 1,000,000 read/write operations per node."

What happens after 1M operations? The nodes catch on fire?


I know nothing about database architecture but I'm gonna assume that the individual operations take longer until you distribute the load across more nodes.


Every system has a maximum capacity, and after that is reached, the requests just see increased latency without increased throughput. At this point you scale up (by replacing it with better hardware) or out (by adding more nodes).


Garbage collection has been the pain point for java based tech like hadoop. Interesting to see databases being written in Golang, which I imagine would have the same issues.


ScyllaDB is written in C++.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: