IBM announces new Spark-based Data Science Experience

Hosted on Bluemix cloud, the platform offers Spark as a service to data scientists

IBM has launched an Apache Spark-based development environment called Data Science Experience on its Bluemix cloud platform.

Data Science Experience is "an interactive, collaborative, cloud-based environment where data scientists can use multiple tools to activate their insights", according to product manager Armand Ruiz.

Apache Spark speeds up the development of iterative and machine-learning-type applications because the data is in memory. It is frequently used together with Hadoop as it is quicker for many jobs than MapReduce, but it can also be used with other storage back-ends.

IBM is a big investor in Spark, ploughing $300m into projects such as the machine learning library SparkML and libraries for the R programming language and SparkSQL querying. The latest offering is understood to be additional to this investment programme.

The idea behind Data Science Experience is to bring together various open source tools in one place in order that they can be used interactively and collaboratively. Being cloud based will make it more accessible to users without an on-site Spark installation.

Among the tools available are the Jupyter data science notebook, a GUI that allows data scientists to write code for Spark in languages such as Scala, R and Python and to share their live efforts with other users. There is also a data visualisation tool called Shiny and tools for preparing, cleaning and loading the data.

The service is now available in limited preview.

IBM's involvement with Spark go beyond contributing to libraries and services. It has rewritten many of its enterprise applications to incorporate the platform.

By refactoring the IBM DataWorks data refinery and ETL (extract, transform and load) solution to incorporate Spark the company was able to reduce the number of lines of code by 87 per cent, Anjul Bhambhri, VP product development, big data and analytics platform at IBM told Computing last year.

Spark is also built into other IBM projects and solutions such as BigInsights for Apache Hadoop, Watson Analytics, SPSS Modeler and IBM Stream Computing.

IBM is not the only tech giant unveiling a new Spark-based service. Microsoft today announced general availability of Apache Spark support for its Azure HDInsight cloud-hosted service for big data analytics, as our sibling site V3 reports.

"Our goal with big data is to make it accessible for everybody. With Spark for HDInsight, we have designed new productivity experiences for the different audiences that use Spark including the data engineer working on ETL jobs, the data scientists who are performing experimentation and the business analysts who are creating dashboards," said Microsoft's senior product marketing manager for big data and data warehousing Oliver Chiu, writing on the Azure blog.