San Francisco-based video services company Ooyala is, in many ways, an ideal big-data case study.
Beginning in 2007 as a public cloud-based start-up, Ooyala expanded into private clouds and its own on-premise data centre and moved away from its relational database roots, all on the back of the then-new distributed computing tools that are now familiar names.
An early adopter of the NoSQL database Apache Cassandra and data crunching platform Hadoop, the firm’s set-up now includes hot new technologies such as Storm, Spark, Shark and Splunk.
Ooyala provides media firms such as Bloomberg, ESPN, Telegraph Media Group and Yahoo! Japan with “actionable analytics”, enabling them to analyse in great detail the way that their video content is being consumed and to optimise its delivery to maximise revenues. The firm’s analytics engine processes over two billion analytics events each day, derived from nearly 200 million viewers worldwide who watch video on an Ooyala-powered player.
“Data is core to everything we do,” Ooyala’s director of platform engineering and operations Peter Bakas told Computing, insisting that the company stores all the data it collects, throwing nothing away.
“We store six years’ worth of data in our Cassandra cluster, more than any company in the world.”
The need to store all of this data, as well as the firm’s expansion into new territories and the requirement for high availability to meet the needs of its client base meant that Ooyala rapidly bumped against limiting factors with MySQL. So how did the company arrive at its current arrangement?
“In 2007 we developed our first-generation analytics systems based on MySQL, Hadoop, MapReduce and Ruby. But around 2009 we ran into limitations with RDBMS [relational database management systems], specifically around scalability, performance and availability, so we made the decision to migrate to Cassandra as our store,” Bakas explained.
“In 2011 we introduced real-time analytics that was based on leveraging Storm [Twitter’s open source real-time analytics software], with Cassandra as our store, MapReduce, and the Ruby and Scala languages.
“Now in 2103 we are building our next-generation analytics system based on Cassandra, Scala, Spark and Shark. We essentially have in-memory cluster computing.”
In 2009 Cassandra was one of very few NoSQL databases available. Now there are hundreds. If he were to start again, would Bakas make the same choice?
“I think we made the right choice. From a features perspective Cassandra has a lot of developer support and enterprise support. The masterless architecture of Cassandra lends itself well to what we are doing, especially around scalability, performance and availability. Being able to linearly scale globally has been huge for Ooyala and our growth,” he said.
“We are in multiple data centres, in public clouds and in our own private clouds that we run on commodity hardware. Cassandra allows us to expand into new territories seamlessly. Other NoSQL databases have similar features, MongoDB for example, but for us Cassandra fits very well.”
Elsewhere in the stack, Ooyala is using Storm to mine the stored data to deliver intelligence in real time such as viewing patterns and personalised content recommendations to its media company clients. Machine data intelligence software Splunk provides visibility across all operations.
“Storm is what we use to process real-time analytics. Splunk we use for log processing and troubleshooting, to help us manage the systems,” said Bakas.
Ooyala is a contributor to the Apache Spark project, an in-memory alternative to Hadoop MapReduce that plugs into the Hadoop File System and is quicker and easier to use than MapReduce in many use cases. The firm also uses Shark, a port of Apache Hive that runs on top of Spark and allows SQL-like querying.
All of which puts Ooyala at the cutting edge of big data technology, with the ability to expand rapidly and to provide real-time intelligence based on huge data sets. So what effect does Bakas think this technology will have on the traditional data centre?
“We store data on commodity hardware in our own data centre but we also leverage regional public clouds appropriately to allow us to contain costs and to scale. If you need to add presence in a region you can just add nodes [Ooyala claims to be running more than 200 Cassandra nodes] and you have a linear scalability.”
Cloud and distributed computing are changing the landscape, said Bakas. Enterprises are taking advantage of public cloud service, while SMEs are building their own private clouds, often on-premise.
“In general, most of the new technologies can run on commodity hardware and can scale, so you can run at lower cost. The barrier to entry to creating your own facilities has been lowered,” he said.
By eliminating high entry costs for big data analysis, you can convert more raw data into valuable business insight.
A discussion of the "risk perception gap", its implications and how it can be closed