How SMACK makes big data faster
SMACK stands for Spark, Mesos, Akka, Cassandra and Kafka - a combination that's being adopted for 'fast data' problems, as Patrick McFadin of DataStax explains
If there's one thing that everyone agrees on, surely, it's that the world of technology desperately needs a new acronym. Fortunately one has just arrived - and it's a cracker.
SMACK stands for Spark, Mesos, Akka, Cassandra and Kafka. These are all open-source, mostly Apache Software projects (Akka is not). The term was coined by Mesosphere, a company that bundles these technologies together in its Infinity package, which was designed in collaboration with Cisco to help developers solve big data and IoT-type challenges where speed of response is essential, such as real-time recommendation engines or fraud detection.
Patrick McFadin, chief evangelist for Apache Cassandra at DataStax, calls these use cases "fast data" problems. We asked him to explain more.
Computing: Why no Hadoop? Is that because SCHMACK doesn't work as well?
McFadin: Hadoop fits in the "slow data" space, where the size, scope and completeness of the data you are looking at is more important than the speed of the response. For example, a data lake consisting of large amounts of stored data would fall under this. Hadoop just didn't have anything to offer for fast data problems, so wasn't added.
How do the technologies in the SMACK stack work together?
Kafka stands as the ingestion point for data, possibly on the application layer. This takes data from one or more applications and streams it across to the next points in the stack. Akka, Spark and Cassandra take data from Kafka into the data layer - Cassandra handles the operational data, while Spark provides near real time analysis of that data. Mesos is tasked with orchestrating the components and managing all the resources used by each of them. As a full round trip, Cassandra can then be used to serve data back to the application layer.
What are the business advantages of using the SMACK stack over other approaches?
Standardising on a set of tooling reduces the problems of trying to integrate products on an individual basis. If a company needs what SMACK can offer, it would be foolish to try to reinvent the wheel over and over.
Since many large enterprises are already using Cassandra as an operational data store, they already have some value from one of the components involved. Adding the full SMACK stack may shorten processing times of data analysis, which can add value to a previously installed Cassandra cluster.
However, any business considering this path still needs to do the analysis of the potential RoI [return on investment]. For example, some Hadoop adopters have been left questioning the true RoI of their projects after running for a while; this is not the fault of the technology itself, but whether it has been applied in the right way. SMACK has to be looked at in the same way.
Is this mix of technologies still at the developer phase, or are enterprises looking at it for production use cases?
Mesosphere most likely has some large customers using Infinity. It's definitely beyond the hacker/developer phase at this point.
Large-scale companies are using a variation of this stack in production as well, particularly those teams that are looking at how to take their big data projects forward. The likes of William Hill are using a stack based on SMACK, for example. Alongside this, Cassandra adoption in the enterprise has been on a serious uptick over the past two years and the release of v3.0 should continue this growth.
Apache Spark is beginning to attract more large software vendors to support it as it fits a different need to Hadoop. However, I think that more of these vendors will graduate up to supporting stacks like SMACK in future as well, as companies want to build up a complete picture of processing for their data. This will become a production requirement for companies as they move from initial pilot phases into relying on big data for their revenues.
What are other approaches you could take to achieve a similar end?
Alternatives would be to replace the individual components. For example, Yarn could be used as the orchestration tool instead of Mesos, while Apache Flink would be a suitable batch and stream processing alternative instead of Akka. There are going to be subtle alternatives to SMACK. The basic premise is still secure though: building an end-to-end pipeline for data needs to have these types of components that can all interact in a manner that is simple to integrate and get up and running quickly, rather than requiring huge amounts of effort to get the tools to play nicely with each other.
Glossary of terms
Akka: a toolkit and runtime aimed at simplifying the construction of concurrent and distributed applications on the Java Virtual Machine (JVM). The only element that is not an Apache Software project.
Cassandra: a NoSQL database management system
Kafka: a distributed messaging system originally developed by LinkedIn
Mesos: a cluster management system co-created by Matei Zaharia who also co-created Spark
Spark: a general-purpose big data processing platform that generally runs in-memory