Transforming information dating back to the 17th century into an online format ripe for the internet age is no easy task, but the Royal Botanic Garden Edinburgh (RBGE) is using optical character recognition (OCR) software to document flora first collected as far back as 1690.
Dr Elspeth Haston, assistant curator for digitisation at the Royal Botanic Garden Edinburgh, last year began the gargantuan task of collating information from the labels attached to almost three million specimens, which the RBGE will make freely available online for researchers, academics and botany enthusiasts as of May 2011.
Prior to introducing the digitisation process in 2010, the RBGE used a flatbed scanner turned upside down to capture documents, but has now switched to a 56-megapixel digital camera that copies both images and text into tiff files that can then be fed through ABBYY's Recognition Server OCR application.
The software's makers claim it can accurately read 47 variety of font types and sizes in 47 different languages, though the proof of pudding is often in the eating as far as OCR is concerned.
"Some of the fonts can be difficult – they are sometimes quite faint, even though they are typewritten. There might be a weak 'e' for example that can sometimes be tricky, but we have lots of accents and other signs from Turkish entries, and it can cope with some of those," said Haston.
The details recorded on the original labels usually include the plant and collector's name, the date and location of collection (often with latitude and longitude included), as well as descriptive text describing its appearance and a barcode the RGBE uses for its own classification and cross-referencing purposes.
The ABBYY software scans the information contained in each tiff file and converts the text it finds into both a searchable PDF image and a plain text file before automatically entering the data into a MySQL database. It also keeps a workflow record in a document management system.
Even if the OCR software cannot capture all the original handwritten detail, simply being able to recognise the typewritten label for classification purposes is an enormous help, however.
"We cannot get the OCR to manage the handwriting, but a large percentage of the specimens are recorded on printed or typed labels and the software copes with those very well," said Haston.
"Even the older handwritten stuff often has pre-printed labels that might say flora or something else, and just catching that is very useful."
That's not to say the RBGE ruled out using handwriting-recognition software to document information recorded by individual collectors in the future, providing the application is 'trained' to identify specific traits in their script.
"From our investigations so far it works better if you have a lot of text done by one person, which is potentially something we could look at, because in some cases we have up to 70,000 artefacts from a single collector," said Haston.
Even as a 'non-IT person', Haston required only half a day's training to get up to speed on the ABBYY software client, which links to the main recognition application running on a virtual VMWare server.
A prior upgrade to 5TB storage area network (SAN) meant no extra data storage capacity was required, even though individual tiff files created by a 56-megapixel digital camera can reach 200MB in size.
The use of OCR software has, in effect, underpinned the entire digitisation process, owing to the fact that manual data input from each label would have taken too much time and lets botanists who would previously have had to visit the RBGE itself access the information they require online, or have samples sent by post.
"Even minimal data entries save us an enormous amount of time, whereas manual data entry, at the speed we would want on a project of this size, is simply not feasible," said Haston.
Have your say on this article
Newsletters
Latest stories from Applications
You may also like
Applications jobs
Technology Patent Wars
Case studies from large organisations across all sectors
... And rich media, and flexible working, and peaks in traffic ...
Upcoming Events
Join us for this Computing web seminar, in which the Head of BI at the Co-operative Group Nick Colebourn will be explaining just how he reigned in the Group’s sprawling database estate and how significant savings were realised and data quality improved as a result.
Date: 31 May 2012
Time: 11:00 AM
Live June 13th 11:00am: Register now. During this web seminar we will be looking at the sorts of incidents that can bring data centres grinding to a halt and what can be done about them.
Date: 13 Jun 2012
Time: 11:00 am
Receive the latest jobs direct to your inbox
Are you being paid what you are worth?