Last month, big data company Cloudera purchased Myrrix, a London-based start-up based on Apache Hadoop and machine-learning library Apache Mahout.
Myrrix is a large-scale machine-learning platform. To many people, the most familiar implementation of machine learning is in recommenders, the technology used by the likes of Amazon to cross-sell products based on purchase history and customer demographics. Recommenders work by quantifying the often loose correlations between variables such as a customer's age and gender and their musical taste. The more data recommenders have to work with, the better they perform.
Myrrix founder Sean Owen, now director of data science at Cloudera, is quick to state that machine learning and recommenders are nothing new. However, the need to operate on big data across clusters of servers rather than on one machine has led to a resurgence of interest in the field.
"The math isn't new, the ideas aren't new, we've had machine learning since the '70s, but now we've got this explosion of data and it turns out that to adapt the algorithms to run across many machines means you have to rewrite a lot of them. We're rewriting them to make them applicable to big data," he told Computing.
This has been the aim of the open-source Apache Mahout project, to which Owen has been a contributor over the years. Myrrix, though, is a complete rewrite of Mahout, designed to be more scalable and more user friendly, separating the front-end from the plumbing out back.
"If you want to use a relational database on your website you don't have to understand much about how it works. In contrast there's really nothing you can do on Hadoop that's nicely packaged. That's what we're trying to do."
In pursuit of this goal of commercialising machine learning at scale, Owen has been focusing on the relationship between model building (crunching the data to discover relationships between certain products and certain types of customer, for example), and model serving – applying the model to the website front-end.
These elements work to different timescales. Model building can happen overnight: this is the sort of batch job that Hadoop was designed for, sifting through the data in search for hidden correlations. Model serving needs to be in real-time, so the right information – such as a recommended album – is served to the right visitor to the site. There is also a near-real time aspect to model serving by which interactions on the website get fed back into the model between overnight builds.
"When I say 'I don't like that album', Amazon is not going to rebuild their whole universe of machine-learning models in response, but they'll record in memory that I shouldn't be shown the album again, not now and not ever. That's the near-real time part," said Owen.
Owen professes himself impressed by the way that companies like Amazon manage their recommendations. The back-end technology they use remains a closely guarded secret (unlike Myrrix, which is open-source) but, he says, the real skill comes with the way the models are applied to the website.
"The back-end models are the well-understood part. Getting the front-end right is the hard part," he said.
"The model building phase is becoming easier and easier. The model-serving phase at scale – connecting the model pipeline, having the data flow in and update the model and serve a new model – no one's really building tools for that."
"The interesting part to me is all the near-real time stuff. The way [Amazon] can feel so responsive and yet still be building models later based on all that data."
The focus on usability is partly what recommended Owen's start-up to Cloudera. Like other big data players, Cloudera is seeking to hide the complexity of the Hadoop ecosystem under the bonnet, so in products like Impala querying data is a more SQL-like process and is producing results in real-time.
"We're increasingly used to real-time responses. If a customer comes into my website I want to target products at them right now, not tomorrow," said Owen.
"Impala moves Hadoop into something more like an application. It lets you do a real-time query on data in Hadoop. Myrrix feels a bit like Impala but for machine learning in the sense that it is a little bit built on CDH [Cloudera Hadoop Distribution] and a bit on its own infrastructure. I think Cloudera is looking beyond just building stuff on Hadoop and more towards the other side of the equation: model serving in real-time. That matched what I was wanting to do."
[Turn to next page]
By eliminating high entry costs for big data analysis, you can convert more raw data into valuable business insight.
A discussion of the "risk perception gap", its implications and how it can be closed