IBM sparks conversations about analytics, processing and the hunt for ET
IBM researchers will present their findings at the Spark Summit in San Francisco
The Spark Summit is the annual event for scientists, analysts, developers and researchers, where they can discuss Apache Spark and how it can apply big data, machine learning and data science to deliver new insights.
This year, the event runs from the 5th to 7th June in San Francisco, where IBM scientists will present several talks on the uses of Spark, including its use in storage; parallel processing; and even searching for alien life.
Making the most of distributed storage
On the 6th June, IBM's researchers will present ‘Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash'. The talk will focus on Crail, an open source distributed storage system designed for fast network and storage hardware. The presentation will deal with getting the most out of modern hardware, for big data and high performance analytics.
The challenge Crail addresses is, says IBM, ‘rather simple': how can you ensure that data is accessed efficiently, while using multiple storage systems (flash, DRAM, remote direct memory access, etc), all operating at different speeds and densities? Crail acts as a blueprint to show how systems can be integrated together for the best performance.
The developing team benchmarked Crail on Spark, achieving a sorting time of 98 seconds for 12.8 TB of data (3.13 GB per minute per core): about a factor of five faster than the winner of the 2014 Spark benchmark. All of the team's benchmarking research efforts, which include SQL and machine learning, will be presented at the summit.
The search for life
IBM scientist Gil Vernick will present his talk on the 7th June, looking at his work with an IBM team on NASA's SETI project, on the IBM Cloud platform. He will present his ‘Stocator' technology, which enhances the way in which large data files are stored and analysed. Together with Graham Mackintosh, also of IBM, Vernick will talk about how Stocator is being used in a collaborative project between NASA, the SETI Institute, Swinburne University, Stanford University and IBM.
Stocator is an open-source object store connector for Hadoop and Apache Spark. It is designed to optimise the performance of these platforms with object stores.
Perfecting parallel processing
Also on the 7th June, Kazuaki Ishizaki's summit presentation will focus on the machine learning library framework and its aging internal APIs, which no longer accommodate the latest technologies that achieved performance improvements in SQL. Ishikazi has, says IBM, made several improvements to update the code. IBM expects that these updates will be particularly welcomed by developers seeking to build apps for the IoT, autonomous driving and weather forecasting; such apps will benefit from improved real-time processing and learning precision.