Big Data & IoT Summit 2017: How CERN will manage hundreds of exabytes of data to support the Future Circular Collider

Few organisations manage and analyse data quite as big as CERN, revealed Dr Jamie Shiers at today's Computing Big Data and IoT Summit

CERN, the pan-European scientific organisation responsible for the Large Hadron Collider (LHC), was developing systems to support big data analytics while most organisations were still struggling with 500-page reports printed out from business intelligence applications.

Planned in the 1990s and only implemented in the 2000s, Dr Jamie Shiers, data preservation project leader at CERN, revealed that the LHC today collects around 50 petabytes (PB) of data every year, an increase of ten times from the 5PB that the organisation used to collect, manage and analyse in the 1990s.

Shiers was talking at today's Computing Big Data and IoT Summit 2017 in London.

"We don't even bother talking about anything below 100TB" - Dr Jamie Spiers, CERN

The next iteration of the LHC, the Future Circular Collider (FCC), will produce hundreds of exabytes (EB) every year.

CERN now has around 200PB of data in total, and that volume is expected to rise to 10EB by 2035, which will mark the end of LHC data collection.

Clearly, today's big data will be "peanuts" compared to the kind of volumes the organisation is producing, managing and analysing in future. "We don't even bother talking about anything below 100TB," he said. With time, the cost of storing data tends to zero, as storage becomes ever-cheaper, he said.

The information produced by CERN's experiments is important, but is not the only data that it deals with: there is also software, documentation, environment and knowledge. Collective knowledge - i.e. that held by everyone working at CERN - is (not surpisingly) particularly hard to capture, said Shiers.

CERN uses a ‘grid' to store all of the data it collects: the raw data is stored in robots and distributed over private optical links worldwide. This approach was chosen over the cloud because of necessity: cloud computing was simply not available when the majority of R&D - and funding - for the LHC was done in the late 1990s and early 2000s.

Several challenges have faced CERN when it comes to its computing 'grid' (commonly known as the WLCG). It is as much about people and collaboration as technology, and getting people on the other side of the world to provide a 24/7 service for a machine that they have never seen, for no clear reason, was a real challenge, says Shiers.

"Most [HEP data] is thrown away before it is recorded; if it were all kept, the organisation would have to store around 50PB - per second" - Dr Jamie Shiers, CERN

The organisation runs regional workshops and daily operations calls to motivate people and maintain the technology. However, something is always broken in the grid, admits Shiers. But the day that the Higgs-Boson was discovered was the first time that the role of computing, through the WLCG, was acknowledged as necessary at CERN.

Data preservation in HEP

Dr Jamie Shiers is in charge of data preservation at CERN

The data from the world's particle accelerators and colliders (HEP data) is both costly and time consuming to produce. However, it contains a high amount of scientific potential and high value for educational outreach. Many collected samples are unique. Because of this, it is essential to both preserve the data and have the full capability to both reproduce past analyses and perform new ones.

Despite the importance of HEP data, most of it is thrown away before it is recorded; if it were all kept, the organisation would have to store around 50PB - per second. The sensors used to collect this data are stable over long periods of time - years, in many cases. Unlike sensors and chips in other industries, they do not follow Moore's Law and double in capacity or performance every couple of years).

CERN's projects often last for decades, and so it is important to keep the collected data usable for at least this length of time. Bit preservation is very important; data is migrated every two-to-three years and accessed - even if it is not being used - at least once a year to ensure that it isn't corrupt.

As a result, Shiers takes many vendors' claims - such as tape manufacturers' thirty year guarantees - with a pinch of salt.

Data continues to be used long after it has been collected, which provides a clear business case for its preservation. Shiers said that as much as 10 per cent of the scientific output will come for less than one-millionth of the cost by keeping data, and even more (as much as a 40 per cent return-on-investment) when you include potential spin-offs, such as advances in superconductivity and distributed computing that it may help generate.

While the hardware costs can be high initially, they tend to trend towards zero. CERN is now producing much more with the same (budget) or less (staff) than it used to in the LEP days.

In closing, an audience member asked how industry can get involved with CERN's projects. To some extent, said Shiers, the industry is already there.

For example, there was a collaboration with Oracle some years ago examining how the software could be changed to make it more usable for CERN. This resulted in several additions to Oracle's technology, including floating point numbers.

Computing's IT Leaders Forum 2017 is coming on 24 May 2017. The theme this year is "Going Digital: Why your most difficult customer is your best friend".

Attendance is free, but strictly limited to IT Leaders. To find out more and to apply for your place, check out the IT Leaders Forum website.