Royal Botanic Garden Edinburgh digitises data on three million plant and herb specimens

By Martin Courtney

11 Feb 2011

Be the first to comment

flowers

Transforming information dating back to the 17th century into an online format ripe for the internet age is no easy task, but the Royal Botanic Garden Edinburgh (RBGE) is using optical character recognition (OCR) software to document flora first collected as far back as 1690.

Dr Elspeth Haston, assistant curator for digitisation at the Royal Botanic Garden Edinburgh, last year began the gargantuan task of collating information from the labels attached to almost three million specimens, which the RBGE will make freely available online for researchers, academics and botany enthusiasts as of May 2011.

Prior to introducing the digitisation process in 2010, the RBGE used a flatbed scanner turned upside down to capture documents, but has now switched to a 56-megapixel digital camera that copies both images and text into tiff files that can then be fed through ABBYY's Recognition Server OCR application.

The software's makers claim it can accurately read 47 variety of font types and sizes in 47 different languages, though the proof of pudding is often in the eating as far as OCR is concerned.

"Some of the fonts can be difficult – they are sometimes quite faint, even though they are typewritten. There might be a weak 'e' for example that can sometimes be tricky, but we have lots of accents and other signs from Turkish entries, and it can cope with some of those," said Haston.

The details recorded on the original labels usually include the plant and collector's name, the date and location of collection (often with latitude and longitude included), as well as descriptive text describing its appearance and a barcode the RGBE uses for its own classification and cross-referencing purposes.

The ABBYY software scans the information contained in each tiff file and converts the text it finds into both a searchable PDF image and a plain text file before automatically entering the data into a MySQL database. It also keeps a workflow record in a document management system. 

Even if the OCR software cannot capture all the original handwritten detail, simply being able to recognise the typewritten label for classification purposes is an enormous help, however.

"We cannot get the OCR to manage the handwriting, but a large percentage of the specimens are recorded on printed or typed labels and the software copes with those very well," said Haston.

"Even the older handwritten stuff often has pre-printed labels that might say flora or something else, and just catching that is very useful."

That's not to say the RBGE ruled out using handwriting-recognition software to document information recorded by individual collectors in the future, providing the application is 'trained' to identify specific traits in their script.

"From our investigations so far it works better if you have a lot of text done by one person, which is potentially something we could look at, because in some cases we have up to 70,000 artefacts from a single collector," said Haston.

Even as a 'non-IT person', Haston required only half a day's training to get up to speed on the ABBYY software client, which links to the main recognition application running on a virtual VMWare server.

A prior upgrade to 5TB storage area network (SAN) meant no extra data storage capacity was required, even though individual tiff files created by a 56-megapixel digital camera can reach 200MB in size.

The use of OCR software has, in effect, underpinned the entire digitisation process, owing to the fact that manual data input from each label would have taken too much time and lets botanists who would previously have had to visit the RBGE itself access the information they require online, or have samples sent by post.

"Even minimal data entries save us an enormous amount of time, whereas manual data entry, at the speed we would want on a project of this size, is simply not feasible," said Haston.

 

Reader comments

Have your say on this article

All fields required. Your email address will not be displayed on the site.

By submitting a comment you agree to abide by our Terms & Conditions

Technology Patent Wars

Large companies such as Microsoft, Facebook and Google have been hoovering up technology patents recently. Is this stifling innovation?

87 %

5 %

8 %