The language of data science

A glossary of terms from the world of data science

Data scientist
Someone who can bridge the raw data and the analysis - and make it accessible.

A data scientist works across IT and business departments. They must be able to take a large data set, model it, and ultimately tell stories from data - usually the hardest part.

Good data scientists not just address business problems, they pick the right problems that have the most value to the organisation.

Google's Hal Varian put it this way back in 2009: "A data scientist possesses a combination of analytic, machine learning, data mining and statistical skills as well as experience with algorithms and coding. Perhaps the most important skill a data scientist possesses, however, is the ability to explain the significance of data in a way that can be easily understood by others."

"The sexy job in the next ten years will be statisticians. People think I'm joking, but who would've guessed that computer engineers would've been the sexy job of the 1990s?"

In 2011 analyst McKinsey estimated that the US alone will need 190,000 new data scientists by 2018.

MapReduce
A processing model for efficient distributed computing over large data sets.

The processing of a problem takes place over a large number of computational nodes, with the answers produced by each node being collected and combined to form the output. Apache Hadoop is an example of a MapReduce implementation.

Hadoop
An open-source software project by Apache designed for the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.

Hadoop is scalable, cost effective and flexible in that it can absorb data from any number of sources. It's also fault tolerant and in the event of a node being lost, directs work to another cluster without the need for programmers to create additional code.

Major users include Yahoo, Facebook, Twitter and LinkedIn.

R
R is a programming language and environment used for creating statistical computing and graphics and as an interactive tool for exploring data sets.

It provides a wide variety of statistical and graphical techniques including linear and non-linear modelling, statistical tests, time series analysis, classification and clustering.

Percolator
Google's Percolator is a system for incrementally processing updates to large data sets which is used to create the Google web search index. It allows reduction of latency in page crawling and indexing by a factor of 100.

Because of percolator the indexing time is now proportional to the size of the page rather than the whole existing index.

NoSQL
NoSQL ("not only" SQL) is a database system designed to handle large volumes of data that may be unstructured or semi structured. Data is usually spread across several machines allowing fault tolerance and rapid scalability.

This makes NoSQL ideal for processing big data, but standard RDBMS platforms remain preferable at lower volume and where ad-hoc queries and analysis are needed, for BI systems for example.

Perhaps the best known examples of a NoSQL database are the open-source MongoDB by 10gen and Apache Cassandra.

Cassandra
Cassandra by Apache is a free, open-source storage system for managing large amounts of structured data. A NoSQL implementation, it is a schema-optional, column-oriented data model. Unlike with a relational database, you do not need to model all the columns required by your application up front.

Cassandra is designed to scale to a very large size across many commodity servers, with no single point of failure, allowing maximum power and performance at scale.