How cloud-based supercomputers are helping scientists to collaborate in the quest to cure cancer

Researchers at eMedLab explain how they're deploying big data and machine learning in the fight against cancer

Despite the great advances being made in modern medicine, a cure for cancer unfortunately remains elusive.

The disease is one of the most common causes of death in the Western world and a string of recent high-profile celebrity deaths from the condition serve as a reminder that cancer, unfortunately, continues to blight humanity.

But scientists and researchers around the globe are working hard to improve cancer treatment using a wide variety of techniques. One of those many projects is eMedLab, a partnership of leading bioinformatics research and academic institutions, which are working together to use supercomputing, big data analytics and machine learning to help find a cure for cancer and other diseases.

The partnership includes seven bioinformatics organisations - University College London, Queen Mary University, London School of Hygiene & Tropical Medicine, the Francis Crick Institute, the Wellcome Trust Sanger Institute, the EMBL European Bioinformatics Institute and King's College London - and its ultimate goal is to aid hundreds of researchers studying cancers, cardio-vascular and rare diseases.

"EMedLab is a £9m project grant awarded by the Medical Research Council in response to the need to develop infrastructure and expertise, and big data and data science in the clinical domain," says Nick Luscombe (pictured below), professor of bioinformatics at UCL and senior group leader at the Francis Crick Institute.

As scientific lead for the project, Luscombe tells Computing that his role is about "bringing together scientists and clinicians in order to be able to use the infrastructure and to analyse the relevant data".

The very nature of clinical research means it generates "huge" amounts of data, ranging from patient records to genomics datasets. However, prior to the eMedLab project "we lacked the ability to bring in computer scientists to harness various analytical methods in order to look at this data", says Luscombe.

One of the key hurdles to collaborative research is the size of the data. "Because of the data being so big, it's challenging to transfer it from one place to another," along with "the added difficulties of privacy", he says. That's why, he adds, the eMedLab project was set up.

"EMedLab is a response to that, bringing together seven partners to all work on the same platform, share datasets and also bring together different expertise," he says.

Bruno Silva, high performance computing lead at the Crick Institute, runs the service operations of eMedLab, a role that makes him responsible for "coordinating a team of technical staff that maintain, operate and deliver cloud services for the research project", he tells Computing.

The project was built in a "very collaborative way across all the institutions", he says, in order to ensure that eMedLab could cater for the individual needs of each partner.

"The decision was to have a data-sharing infrastructure that could provide data to computing resources at a very high rate and be processed quickly, while at the same time provide as much flexibility as possible, owing to the diverse nature of the research and the different institutions," says Silva.

"The local support teams at each institution needed the ability to use their own methods and tools," he adds.

[Please turn to page 2]

How cloud-based supercomputers are helping scientists to collaborate in the quest to cure cancer

Researchers at eMedLab explain how they're deploying big data and machine learning in the fight against cancer

The procurement process set out key requirements for the project, such as the latest high-performance computing (HPC) systems and the ability to leverage cloud, so that "the potential from HPC could be met without being constrained by a single software stack in a typical HPC system", says Silva.

EMedLab eventually opted for a system from HPC, data management and data storage provider OCF that includes technology from IBM, Red Hat an Mellanox. "The guarantee of support throughout the lifetime of the project was key," Silva says.

The project is unique in that unlike other supercomputers, it'll use cloud and enable researchers to access 6,000 cores and six petabytes of storage, no matter which of the eMedLab partners they're working from.

"We're looking at a performance of 100 gigabytes per second throughout the whole cluster. A cluster of 6,000 cores, each compute node has 24 cores each and 512 gigs of RAM, so half a terabyte and a powerful machine," says Silva, who describes the system as a "unique set-up".

The key benefit, Luscombe says, is that the HPC cluster will enable users from all of the institutions taking part to examine medical data and try to find patterns that could lead to cures.

"The really key thing here is there are seven partners, looking at different types of data - patient records, genomic data and clinical imaging data - and trying to bring these data types together to answer clinically relevant questions that have impact on patient treatment," he says.

So why did eMadLab choose OCF?.

"One reason is the ability to handle large amounts of data, so the storage aspect is crucial. The second is the ability to have this data stored securely, but then also have a common resource that's accessible by potentially a large number of people across different institutions and to have the appropriate access rights to this in a decentralised manner," Luscombe says.

Flexibility was another crucial factor, he adds.

"Different projects will have different requirements so you want to build tools and resources which are needed for different types of work and you wouldn't necessarily want them to have the same set-up," he says.

Jacky Pallas, director of research platforms at UCL, describes how this can benefit researchers across all of eMedLab.

"If we're working on a platform where we can hold information from thousands of patients - their DNA sequence data, their cancer imaging data and their clinical data - and bring those different sets together and use machine learning and other techniques to interrogate it, then the data all needs to be in one place so that different people working on different types of cancer can understand commonalities of different cancer types and take that forward into therapy," Pallas says.

[Please turn to page 3]

How cloud-based supercomputers are helping scientists to collaborate in the quest to cure cancer

Researchers at eMedLab explain how they're deploying big data and machine learning in the fight against cancer

And machine learning is "absolutely" useful to the clinical research, Luscombe tells Computing, describing how it can enable a lot more data to be crunched.

"When you look at it from the molecular perspective in terms how these diseases occur and the measurements you make, the data types you gather in order to examine these diseases, there are a lot of underlying commonalities," he explains.

"So one of the aspects of this is to take advantage of the commonalities underlying the different data types to try and break across the different silos of these different diseases. Of course, machine learning is one of the key methods different users are hoping to apply in looking at these."

It's the ability of machine learning algorithms to automatically incorporate new data which has massive potential in clinical research, he says.

"Traditional methods have used various regression models, which are successful, but they're limited because they're fixed; you've got to fix what predictor variables to put in and you can't incorporate new data types very easily," Luscombe says.

"So you can easily imagine that being complemented with data mining and machine learning to learn new variables automatically and then also to incorporate unconventional data types like genomics data and aspects of what mutations you have and predict different outcomes," he continues, adding "machine learning has an important part to play" in diagnoses and treatment of diseases.

Therefore, in addition to researchers and medical experts being a key part of the eMedLab project "we're collaborating closely with computer scientists", says Luscombe.

"We're collaborating with the new Alan Turing Institute, next to the Crick Institute in the British Library," he continues, adding that partnerships between different institutions and experts will be key in finding cures for cancer and other diseases.

"These methods can't be developed in isolation, they've got to be done in close partnership with the clinicians. Data can't just be viewed as numbers, so the partnership is critical," Luscombe says.

"What we're aiming for is for this project to act as a bridge between different data types, a bridge between these different partner institutions. It's a bridge between different expertise, so we really view eMedlab as a hub," he concludes.