The Natural History Museum is digitising the skeletons in its closet

How a decades-old mosquito corpse helped to understand the spread of Zika

The Natural History Museum's Collection deserves the capital letter: it supposedly holds more items than any similar museum in the world, excluding the Smithsonian: a staggering 80 million+ items. Even speed-reading the list of artefacts would, it is estimated, take decades.

Nobody can possibly be acquainted with every item in the Collection, but as the head of the Museum's Digital Collections Programme (DCP), Helen Hardy is better-positioned than most.

Helen Hardy - image credit: G66

"Our collection has a huge scientific audience and scientific significance: it's one of the largest collections in the world...and the most broad in terms of time and geography [Editor's note: the Smithsonian is focused on North America]. And it's the only time-series data that we have of our environment and our world that goes back 200 or 300 years... You can get a lot of observational data in more recent times, but probably reliably only over the last few decades.

"The difference between collections data and observations data is with collections data you can always go back to the original object and check that the person who observed it in the first place was right, and you can change your mind."

She adds, "I think for many museums, when they're digitising it's about the concept of a digital museum, and perhaps their public audience. That is part of it for us, but the main driver was that research audience, and the fact that if we release the data en masse, that you can do new kinds of research with it, and it's kind of a new paradigm. So it's not just that people can look at things online and figure out whether they need to visit us, or you do some of that science without visiting - it's that they can do new things."

There's some really cool science being done

Having the ability to assess specimens digitally - without risking them in transport or observation - is one of the key drivers behind the work of the DCP. The Programme has a simple - but not easy - goal: it needs to digitise the Collection.

A digitiser working on the Collection (© Trustees of the Natural History Museum, London)

Beginning in 2014, the DCP has worked to photograph, scan and otherwise create digital records of specimens. Despite working on the project for five years, it is only about five per cent complete, and although work is speeding up, the Collection also continues to grow:

"I think people think of a museum as somehow a fixed entity, but actually, we're still collecting in the field ourselves; we still purchase very exceptional specimens occasionally, and we still receive specimens from collectors all over the world, on a fairly regular basis. So it's growing different amounts every year, but tens of thousands, maybe hundreds of thousands of items a year."

Hardy gives an example: a paper published in 2017 used mosquito range data from the NHM's collection, and others, plus information about population density, to model the global risk of exposure to the Zika virus. In her words, "There's some really cool science being done."

Insects like mosquitoes make up a significant proportion of both the Collection itself (more than 30 million specimens) and its digitised segment. The smallest item digitised to-date is the microscopic fairyfly, while the largest is the 6m blue whale skull that hangs in the Museum's central Hintze Hall.

The Natural History Museum is digitising the skeletons in its closet

How a decades-old mosquito corpse helped to understand the spread of Zika

Standardising uniqueness

With such variety in the artefacts, the DCP team must work hard to ease the digitisation process. The biggest challenge is in trying to apply industrial-scale processes to completely non-standard items, which requires some out-of-the-box thinking. For example, one of Hardy's colleagues, Steen Dupont, builds prototypes out of technical Lego, and the workspace features light boxes built using old Museum promotional boards.

"[Lego is] quite a clever, low-cost way to try things out and to build a little framework where you can put a specimen in the middle, and turn it round; to build supports for cameras or mirrors... It's just a really convenient way to experiment, and of course if it doesn't work, you can take it apart and use it for something else, you haven't invested a lot of time and money in something that doesn't work. So it's been really helpful for us."

For some of the [vendors] we've worked with, it's been a good opportunity for them to push the boundaries of what their programmes are doing

Once the team captures the images or scans, there is still more data to extract. For example, they are working with barcodes to automatically record information from the (often handwritten) labels, without manual transcription.

A recent acquisition of wasps, part of the Cooper Collection, that have now been digitised (© Trustees of the Natural History Museum, London)

Another project is the Specimen Data Refinery, which puts images through semantic segmentation to pick out different parts - like the specimen, the labels and attributes of those labels (a red border denotes a type specimen, for example, from which a species was described). The specimens then go through a colour analysis or automated measurement, while labels are subject to optical character recognition. The system will "massively" increase the ability to extract scientifically useful information from digitised artefacts, says Hardy. She adds:

"[T]he automation...will be really brilliant when we can get it working. It's partly using mainstream solutions. So OCR, for example: there are some quite good mainstream solutions now, but often it requires some kind of tailoring or extra R&D. For some of the people we've worked with, it's been a good opportunity for them to push the boundaries of what their programmes are doing, because we obviously have many, many, many languages, including a lot of Latin, on our labels. We have handwriting of all kinds from all eras; we have, you know, German Gothic or whatever. We have a huge range of different date formats and different geographical entities to deal with. So...lots of places have complex data, but ours can give them a run for their money, for sure."

Connecting with open data

Another benefit to going digital is the ability to more easily connect with other institutions' data. Museum collections across Europe are now part of the ESFRI roadmap (through the DiSSCo project) that recognises them as a distributed research infrastructure, similar to CERN or the large telescope arrays.

As part of this move towards working together, much of the Museum's Collections data is now open by default. In-house developers built its data portal using open-source system CKAN; the code used is on GitHub; the Museum uses Creative Commons licenses (CC0 and CC BY); and the data is accessible through APIs. "Unfortunately," Hardy adds, "it's not terrifically accessible to humans unless they know Latin taxonomy, because that's still so much the organising principle."

A butterfly (Poecilmitis mithras) during digitisation (© Trustees of the Natural History Museum, London)

In the three years that she has been at the Museum, Hardy has taken the Digital Collections Programme from a situation of having no permanent digitisers to employing five, plus one senior, as part of her team of 10 - a good thing, as the demand for their time continues to grow.

"We've got to the stage now where people want their collection digitised, which is great, because that wasn't true...when I arrived about three years ago. I said I wanted to get to the point where we had a queue, which we definitely have now. But [the challenge is] explaining to people that because every collection is different, every workflow is at least a bit different, and we're trying to have a balanced portfolio.

"We'll have things that we know how to do - give or take, we have to adjust the workflows a bit each time. We broadly know how to digitise microscope slides, pressed plants on paper [herbarium sheets] and pinned insects, for example. We know we can get through hundreds of those a day; maybe thousands, depending how many digitisers we put on it. And we want to do those things because we need to be cracking through the numbers."

Even though there are no plans to fully digitise the entire Collection ("some of the [specimens] look like boxes of dust"), there is always more data to be captured, and Hardy is fully aware of the scale of the challenge, admitting that it will take "decades" to complete. However, she isn't discouraged: even the small proportion that has already been digitised - five per cent of the 80 million items - is doing "loads of measurable good," she says with a smile.