Spotify: Hadoop is hot (but Hive is not)

Being based on Hadoop from the beginning has made a big difference to our ability to innovate, says Spotify's Josh Baer

Spotify is a company that does things differently. Josh Baer, Hadoop product manager at the music streaming service, puts this down - in part at least - to the fact that it has been able to grow rapidly thanks to its early adoption of big data technologies, without having to concern itself with how to deal with legacy infrastructure.

"For most companies that have legacy systems their biggest challenge is how to get data into Hadoop. We are very fortunate that we had Hadoop from the very beginning. We grew up on Hadoop," he says. "We were creating the concept of a data lake before anyone started calling it that."

Not having the millstone of legacy integration to deal with has left Spotify freer to innovate products, ideas and methodologies. For example, the firm has helped to advance the process of agile development at scale through its way of organising developers and managers in squads, tribes and guilds.

"At the top level you have tribes, they organise round a mission. For example, I'm part of the infrastructure and operations tribe," Baer explains.

Within each tribe, he continues, there are two other structures: chapters, which are made up of people who speak a common language such as front-end engineers; and squads, which are more multidisciplinary, comprising, for example, front-end, back-end and data engineers together with an agile coach and a product owner. Squads are able to work on specific projects autonomously, almost like start-ups within the Spotify whole. There is a squad exclusively dedicated to the search feature, for example.

"I'm the product owner of the Hadoop squad, which has a mission to make sure Hadoop is up and running at all times. We do Hadoop-as-a-service for the rest of Spotify," Baer says.

"Then there are guilds, which are cross-tribe focus areas, so we'll have a data guild that just talks about data, such as getting data from Cassandra to Hadoop or from Hadoop into ElasticSearch."

Moneysupermarket.com CIO Tim Jones recently told Computing that he is an admirer of this way of working. For his own part Baer says he always keeps an eye on what Yahoo, LinkedIn and Netflix are doing in managing large user datasets across Hadoop clusters.

"We read what they put out and find they are writing about problems we're just starting to hit," he says. "At the same time we spread information [to other Hadoop users] about the issues we hit and how we solve them."

Spotify has grown to 1,300 nodes, and Baer reckons data volumes are doubling year-on-year. Rather than the music, which is streamed from a separate system, the Hadoop cluster handles data from the Spotify application. This might be combined with data from external sources such as Wikipedia before being crunched for machine learning purposes: looking for hidden patterns.

The firm uses this data for real-time analytics to improve its algorithms, to test out new products and also to fix bugs. After initial testing, new features and services are rolled out to a small fraction (0.5 per cent) of the user base whereupon their success or failure can be quickly ascertained and any bugs or unexpected issues, such as higher latency on a particular platform, dealt with before a larger group is trialled.

"We have this concept at Spotify of 'hack weeks', when developers can do something completely different for a week, and we've had some pretty good features come out of that," Baer says. "But some ideas that seemed to be cool in reality don't get used much, so we look at the data and we fail fast. We celebrate the failure."

Baer is enthusiastic about the Hortonworks HDP Hadoop distribution used at Spotify and with his team's relationship with that distributor, but says that one element of the Hadoop ecosystem requires particular attention: Apache Hive, the data warehousing software. Baer says Hive is too slow for Spotify's analytical purposes.

"Our analyst might get asked which city contains the most Justin Bieber fans. They'll write an algorithm that runs over the listing data and that might take 40 mins. We want something for that analytics use case that's much quicker.

"We want Hive to get a lot better," he says. "But one of the things we really like about Hortonworks is that they're very focused on that. Hortonworks has really taken Hive on and they're making efforts to improve it."