Why we use Presto not Hive for interactive Hadoop queries, Shazam data engineer

Dan Osipov, data engineer at music app and media streaming service Shazam, explains why Facebook's Presto is quicker by design

Hadoop is a revolutionary data management platform and one whose time has come, but it is not great for everything. For example the sort of rapid ad hoc querying often required for analytics is not Hadoop's forte, at least not without some sort of add-on.

The sticking point here is Apache Hive, the data warehousing solution that was designed with batch processing of large datasets in mind.

King, producer of the Candy Crush game, ended up using Exasol's analytics database alongside Hive in order to process the sort of queries that its analysts need to make on the fly.

"That quick dive into the data, getting a basic view of what's going on and understanding things before you start doing the more elaborate investigations? That part of the process just isn't in Hadoop," said VP analytics Vince Darley.

Meanwhile, Josh Baer of music streaming service Spotify told Computing that querying user data to find which city has the most Justin Bieber fans might take as much as 40 minutes, should you ever want to do such a thing.

"Hive needs to be a lot faster," he said.

So why is Hive still slow for queries like Spotify's despite all the work being put into it by Hortonworks and other developers on the Apache project? Who better to ask than someone from another music company that is using Hive but is now deploying another solution for ad hoc SQL queries.

Dan Osipov is a data engineer working for Shazam, the company behind the eponymous music recognition app and media streaming service. He has been working with Hadoop for three years.

Computing: So why is Hive slow?

Osipov: Any query written in Hive is translated into a MapReduce-style computation that is submitted to a Hadoop cluster for execution. Non-trivial queries will be translated into multiple MapReduce stages, with intermediate results being written to HDFS.

What do you use instead for such queries?

We use Presto [the distributed SQL query engine for big data developed by Facebook]. I've used it for half a year.

How does the architecture of Presto make it faster than Hive?

Presto translates the query into a pipeline execution rather than a MapReduce workload. It uses memory much more aggressively than Hive, keeping the intermediate data in memory, rather than using disk.

Presto is optimised to work with columnar storage formats like ORC or Parquet, where data is broken up by column, and column values are stored sequentially, rather than standard row storage format.

This means that when a query selects only a few columns, Presto only looks at the files containing the data for those columns, ignoring the rest ... Column metadata is also cached in memory by Presto, so it's able to cut down the files that need to be examined for any particular query very quickly.

For what type of use case is Hive better than Presto and vice versa?

Presto is optimised for interactive queries, where a data analyst is able to quickly parse lots of data to perform an aggregation, or to select a smaller data set. Hive is still useful for batch processing, where the output is too large for the end user, and is meant for further processing. Finally, Hive is more mature, and generally more stable than Presto.

What alternatives to Presto are there?

Cloudera Impala is very similar in its architecture and performance, the differences are mainly operational. Both databases use the Hive Metastore for table schemas - which makes them easy to use alongside of Hive. In fact, utilising Hive is still necessary for some operations, like loading data in.

Apache Drill is a very interesting project, able to interact with various data formats. It's very good for processing arbitrary JSON. SparkSQL is also evolving very fast, and is integrated very well into the Spark ecosystem. Once you have Spark installed, the barrier to trying out SparkSQL is low.

How did you get started with Presto at Shazam?

We used it on a greenfield project where we evaluated Presto against a few other systems - Hive, Spark and Drill - for interactive analytics against raw data, and settled on Presto.