MLlib [1] is definitely worth checking out as a Spark library for
common ML algorithms - it's crazy how little code is needed to
implement efficient, distributed inference algorithms.
Spark's primitives (especially the RDD abstraction) make it feel like
a DSL for distributed ML algorithms - for example, check out the
implementation of distributed alternating least squares matrix
factorization at [2] - ~100 lines of Scala code once stripped of the
Java interoperability boilerplate.
On the weekend, I took a crack at implementing some distributed ML
algorithms using Alternating Direction of Multipliers (ADMM) in Spark,
following [3]. In a few hours and ~400 lines of Scala code, it was
possible to implement distributed versions of L^1 regularized logistic
regression, ridge regression, and SVMs [4].
It's a very impressive framework, and very much empowering.
I believe the answer to that question is yes. I've raved about Spark many times on HN before. We're using it and absolutely love it.
Not sure why the author thinks there's no built-in support for iterations. The support is there, native scala/java. But you might have to collect to the driver/master to check for convergence, for example. Their simple logistic regression example doesn't check for convergence, but hard codes the number of iterations.
http://spark.incubator.apache.org/examples.html
Some quirks that I've learned. When doing groupbys or other join operations, it helps to specify the number of partitions -- actually this is true in general, but more so for join-based operations, otherwise you can run into memory/gc issues. The equivalent is of course, setting the number of reducers to something which ensure you won't run out of heap space.
Because spark serializes the closures which transform your data, if you don't cache (at least via disk, using persist), then when iterating over an RDD or re-using it, you'll just waste cycles.
Beyond that, as an experienced scala developer, I've found that Spark feels incredibly natural. And being able to run things on the repl cannot be appreciated enough.
Spark cleans up some abstractions found in the Hadoop ecosystem but I would hesitate to call it the "next big thing" because it doesn't really address any of the core weaknesses of Hadoop. In Big Data, there are three touchstone applications that the current generation of platforms largely do poorly: real-time, geospatial, graph. Spark does not really address any of these. It might be more accurate to say that Spark is the "next big thing" in the Hadoop universe but still retains many of the same limitations on expressiveness, thereby creating opportunities for a more general Next Big Thing in Big Data.
Specifically, "faster batch + in-memory" is basically a simple patch on the batch mode problem but does not really address continuous flow real-time models. At large scales it is quite difficult to get robustly efficient behavior out of a parallel system that is batching things on 5 second intervals if the data flow is anything but trivial.
For geospatial and graph analysis, Spark appears to retain the same limitation of Hadoop in that it cannot deal with data models and operations without an a priori optimal partitioning function. Static hash and range partitioning won't cut it, particularly if the streaming data sources are actually real-time. The ability to generate uniform partitioning of complex data models with inherently unpredictable data distributions is critical to parallelizing some important analysis types but there is no obvious support for such mechanisms in Spark.
One could look at Spark as Hadoop done right, more or less.
Spark does look very promising. It is great being able to experiment with jobs on distributed collections from a Scala repl. Consequently it is very quick to get started.
My first impressions are that there seems to be a fair amount of abstraction leakage. Some things that compile and look valid fail at runtime - e.g. referring to other RDDs from within a filter predicate or map function. Other More complex jobs cause the nodes to fail and it gets into infinite loops of restarting the nodes, replaying the job, and them dieing again.
I hope once I get a better understanding of what is going on underneath I will understand what is going on here.
More complex jobs cause the nodes to fail and it gets
into infinite loops of restarting the nodes, replaying
the job, and them dieing again.
This is probably a bug in your code. Debugging these cluster applications does take some getting used to. You'll want to look at the stderr output of the failing executor, and you'll probably see that it's dying due to some kind of exception. You can do this by visiting port 8080 of the master node over HTTP, i.e. http://mymaster:8080. Feel free to email the Spark users list if you have any questions: https://spark.incubator.apache.org/mailing-lists.html
It's true that you do have to understand the programming model and some details of how it's implemented to use Spark effectively. However, any abstraction that was "pure" and perfectly non-leaky would necessarily sacrifice some performance and transparency to achieve that goal. Spark aims to be both high-level and high-performance.
Full disclosure: I'm on the Spark team at the UC Berkeley AMPLab.
The best way to try out Spark & related tools is to follow the 2 days exercises from the AMP Camp 3, they have a script to launch the whole cluster (Berkeley Data Analytics Stack) on ec2 very easily so you can get your feet wet without too much effort.
I helped write the AMPCamp exercises (and parts of Spark) - the exercises were done in the context of a two-day long seminar held at Berkeley where the days were mostly filled with talks/training about the stack. The actual exercises were designed to take a couple of hours each day, with plenty of time for questions/etc. You can probably get through them in an evening, and I've seen people do it in less!
In the next year I can imagine most people deploying Spark will be doing it on Hadoop, since Cloudera 5 will support Spark. It's a natural fit, most people don't hate HDFS but their use case doesn't naturally fit the MR programming model.
We've been using it on HDP2 for about a month now. Everything works fairly well and it was super easy to setup because of YARN (and the work the Spark team put in).
You don't happen to work at Cloudera, do you? I noticed you have some submissions about Impala and Oracle being evil, which seems to be a pretty common view among the ex-Oracle DBAs there
I do happen to work at Cloudera (hence the Impala submissions), although I'm neither an ex-Oracle DBA nor a huge believer that they're evil. I really don't have a lot of first-hand experience with Oracle as a company - which as you'll see, is why my submission was actually a question about the community's perception and why that's a common view.
Hey, I've recently begun the interview process at Cloudera, would you mind sending me an email? My address is in my profile, and I'd love to ask you a couple questions.
Which areas would you say HDFS needs improvement the most? Just so you know, HDFS is still very actively developed, and keeps introducing features / improving functionality (e.g Native NFS, In-Memory Caching, Short Circuit Reads, High Availability, Namespace Federation, etc) on a pretty regular basis.
Feel free to suggest new features, or contribute to the project yourself.
I know of the JIRA, thanks. I'm aware that it is being actively developed, but I don't necessarily believe that activity is progress. Keeping a few different players in the mix helps keep everyone focused on progress.
It is no coincidence that we're seeing the theme of decentralization cropping up everywhere. Consider an investment approach based on the assumption decentralization is getting ready to eat the world.
Spark's primitives (especially the RDD abstraction) make it feel like a DSL for distributed ML algorithms - for example, check out the implementation of distributed alternating least squares matrix factorization at [2] - ~100 lines of Scala code once stripped of the Java interoperability boilerplate.
On the weekend, I took a crack at implementing some distributed ML algorithms using Alternating Direction of Multipliers (ADMM) in Spark, following [3]. In a few hours and ~400 lines of Scala code, it was possible to implement distributed versions of L^1 regularized logistic regression, ridge regression, and SVMs [4].
It's a very impressive framework, and very much empowering.
[1]: http://spark.incubator.apache.org/docs/latest/mllib-guide.ht...
[2]: https://github.com/apache/incubator-spark/blob/fdaabdc673875...
[3]: http://www.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pd...
[4]: https://github.com/ajtulloch/admmlrspark/tree/master/src/mai...