What's new in Spark and machine learning?

Original creator of Spark Zaharia talks about ML Flow and Project Hydrogen, two recent projects designed to make machine learning easier

At the Spark + AI Summit Europe in London last week, we caught up with Apache Spark's original creator Matei Zaharia to talk about recent developments.

Zaharia (pictured left) created the cluster computing and streaming analytics framework in 2009. Adopted by the Apache Software Foundation it became one of the fastest-growing open source projects ever. He is now Chief Technologist at Databricks, the company set up to commercialise Spark.

Zaharia and Databricks co-founder Andy Konwinski (right) spoke about open source two projects, both still in relatively early stages of development, that aim to democratise and standardise the process of pushing machine learning (ML) models into production: Project Hydrogen and ML Flow.

Currently, said Zaharia, machine learning is largely the preserve of the "one per cent" of organisations, those with the skills and resources to take advantage of the new technologies which have created their own internal AI platforms for deploying machine learning models, giving them a considerable competitive advantage.

"The organisations who are using machine learning very heavily are the large web companies or other tech companies like Uber," he said, drawing an analogy with early users of big data technologies such as Hadoop a decade ago, in which a few companies (largely the same ones) were able to get ahead by attracting expertise that at the time was in very short supply.

"There are a lot of things involved in machine learning, you have to continuously rebuild a model, you have to test it, you have to deploy it to be to do A/B testing and all of that, and without systems to manage this workflow it's a lot of manual work. There's lots of handing off code and models between teams which can be very slow. So these companies have built internal platforms such as FB Learn at Facebook, Michelangelo at Uber and TFX at Google. And these end up supporting dozens of applications."

ML Flow

One of the upcoming developments, ML Flow, is designed to standardise the process of getting ML models into production. Because developing successful models requires dozens of iterations, perhaps using different datasets, altered parameters and alternative software libraries, keeping track of the process is difficult. What's more, developing machine learning models is usually a multi-departmental endeavour, with data engineers looking after the pipelines and data scientists developing the models themselves.

"There is no established best practice engineering class for machine learning and there are many pitfalls and places fit to go wrong," Zaharia said.

The aim of ML Flow, which is a joining together of various existing projects including Docker containers, Git, and various software libraries and APIs, is to introduce standards and - in effect - "containerise" machine learning models, data and workflows so that they can and more easily moved from place to place and reused without the data engineers having to repeatedly tinker with their environments.

ML Flow provides a structured logging API to record parameters, metrics, configurations, models and code versions, together with an interactive UI that allows the results of each run to be visualised graphically, annotated and stored in one place.

Now at version 0.7.0 this is still an alpha-stage project, but it has achieved an impressive rate of take-up since its release back in July, with newly announced support from RStudio, the IDE for the R statistical programming language. Python, Java and Scala are also supported via APIs, there are storage backends most of the popular cloud platforms, and the project now has 48 contributors.

"It took Spark two years to reach this stage," said Zaharia.

While it is still a bit "rough around the edges", Databricks has been road testing ML Flow among its customer base and the majority have expressed an interest, said Konwinski. "We've been putting a lot of effort into these projects to create open source versions of things like Michelangelo and we are seeing a ton of traction there, and we're multiplying that with deep integration into the Databricks platform."

ML Flow will be available on the hosted Databricks unified analytics platform sometime next year.

Project Hydrogen

The other recent innovation is called Project Hydrogen, a new SPIP (Spark Project Improvement Proposal) aimed at better integrating Spark with some of the major deep learning and machine learning frameworks such as PyTorch and Tensorflow.

"Project Hydrogen is a bridge to join these two things together," said Zaharia, adding that it will be integrated into Spark 2.4, currently in the final stages of refinement.

"The idea is you'll be able to easily combine the work you do in Spark for data preparation and analysis with calling into these machine learning frameworks, so an efficient way to combine both types of computation."

Nvidia is among those companies that expressed interest in Project Hydrogen, he added, because of the potential for more efficient ML modelling using GPUs.

What can artificial intelligence and machine learning do for you and your organisation?

If you don't know yet, or want to make sure that you're not missing out, Computing's first AI & Machine Learning Live event is for you. To find out more, check out the Computing AI & Machine Learning Live website. Attendance is FREE to qualifying IT leaders and senior IT pros, but places are going fast