Paypal: Big data is a scam

Payment firm's chief architect claims relational tools can solve almost all data problems

Daniel Austin, Paypal's chief architect, has claimed that big data is a scam "most of the time", stating that relational database tools can almost always solve supposed big data problems.

Austin made his controversial claims at the MySQL Connect stream of this week's Oracle Open World conference in San Francisco.

"Big data is a big scam most of the time," said Austin. "A lot of people think they have a big data problem, but in fact they don't, they just want to find a big data solution because they think it looks good."

Awareness of big data has been growing in recent years among IT professionals, as businesses look to exploit the torrents of unstructured data arising from internet log files, social media, and smart sensor networks, among other sources.

The problem experienced by many businesses is that this unstructured data does not fit easily within traditional relational database management tools, making it hard to capture, store and analyse the information.

Austin gave the audience his take on the situation.

"There's lots of data coming in very quickly, with complex data models, and you need to write [to disc] faster than you read. There is also a fast data problem [as opposed to the volume problem], as seen by firms like Twitter, who need to process millions of queries per second.

"But you don't necessarily need [non-relational database management system] NoSQL. That's just one proposed solution."

Austin said that people look to non-relational tools because they feel that typical RDBMS (relational database management systems) are slow, require complex data management, are costly to hold and maintain, and slow to change and adapt.

"People want to give up their relational model, I don't know why. I like my relational model. You don't have to give it up to solve your big data problem."

[Turn to next page]

Paypal: Big data is a scam

Payment firm's chief architect claims relational tools can solve almost all data problems

One solution some turn to is open source software framework Hadoop. But Austin said that this doesn't provide the fast processing that users expect.

"Hadoop-based solutions are mostly batch based. Talk to large consumers of hadoop systems like Netflix, you'll find they're not processing data in real time either, it's mostly batch processing."

Austin added that there are trade-offs and compromises inherent in any solution.

"The real story here is the trade-offs made by designers of different systems, and the main trade-off is between consistency and availability, usually in favour of the latter.

"We're on the upswing of the hype cycle in big data, on the slope of enlightenment. We should see mature models in a year or so for this software. Look at [open source distributed database management system] Cassandra two years ago and today, you'd be amazed at some of the problems that have been solved."

He also explained that neither big data nor NoSQL are new ideas.

"The first and most successful NoSQL system is DNS [Domain Name System] that's been around since 1983. Availability is its number one concern."

The solution that Austin built for Paypal to solve its unstructured data problem uses traditional relational database tools. He referred to it wryly as "YeSQL" in his presentation.

"I built a large big data system with a relational model, and I call it YeSQL. It works really well. My executives don't care about big data, they just want the data to be accurate. They said it can't fail, and it must support transactions. The maximum data volume should be 100TB of fixed data storage, and must scale linearly with costs.

"Data must be available anywhere in the world in under 1,000ms. So I chose a MySQL cluster, a fast recovery, in-memory system and deployed it over Amazon Web Services.

He summed up by entreating delegates at the event to only use big data solutions when there is a genuine big data problem.

"Don't just follow the technology fashion. You can achieve high performance and availability without giving up relational models and read consistency, just say YeSQL! Not all big data solutions are created equal, so you have to ask; what trade-offs are most important to you?"