Behind the scenes of the British Library digital newspaper project

Brightsolid scans historical newspapers for new website

Online publisher Brightsolid has worked with IBM to digitise four million pages of the British Library's historical newspaper collection for online access as part of a 10-year big data analytics project that could cost Brightsolid millions of pounds.

The online publisher was selected by the British Library in April 2010 to digitise the newspapers and it launched the British Newspaper Archive website in November 2011. Brightsolid is investing the money because it will see a finanancial return from payments for access to the digital content.

"We selected Brightsolid as part of an open tendering process because of its track record in conducting similar projects and because it works with enterprise-level workloads regularly," said Nick Townend, digital operations manager at the British Library.

Brightsolid was involved in digitising the 1911 Census for England and Wales.

"Brightsolid also uses OCR (optical character recognition) text recognition software and DocWorks for its workflow engine which we preferred," he continued.

IBM provided the virtualised HS22 Blade platform, which Townend said was chosen for its reliability and energy efficiency.

The intial scanning is taking place in Colindale, London and the data is transferred to Brightsolid's data centres in Dundee, where the digital archive is hosted.

Brightsolid is funding part of the scanning process which costs around £1 per page.

"We cover 40 million pages. A £40 million investment would not have been feasible for the British Library; Brightsolid's investment was essential," Townend said.

Brightsolid generates revenue from the pay-per-view and subscription options offered online, such as the 12-month subscription costing £79.95 for unlimited access.

For copyrighted material, Brightsolid gains permission to scan the material from the rights holders, who are in turn remunerated. The British Library receives a royalty on material that is out of copyright.

Townend said that the biggest challenge for Brightsolid was to create a web solution that could cope with peaks in demand from the launch date onwards. The launch on 29 November saw 1.2 million searches across the 4 million pages.

Malcolm Dobson, CTO at Brightsolid, explained that to ensure that the website could cope, Brightsolid and the British Library had to agree on the demand expected from the launch.

"Using our knowledge from the Census project, we projected what we thought the peak traffic would be and peak searches per second with the British Library and wrote a test plan.

"We had to decide what the optimal server configuration was going to be and then attempted to deliver those things using a set of tests," he said.

The scanning process began in January 2011, with a website-testing period starting from late summer until the launch in November.

Dobson went on to say that the bandwidth also had to be considered to enable the website to function during peak demand.

"We used the Microsoft Deep Zoom image tiling system, which allows you to look at different resolution levels of an image and at individual tiles or more detailed resolutions as you zoom in.

"So it is much faster and responsive and, ultimately, saves bandwidth, but we needed an efficient tiling system, otherwise image loading times would have been unreasonable."

"We also split infrastructure across two physical sites to ensure that we had excess of 1 gigabyte per second of bandwidth for our site to launch," he said.

Brightsolid said that the second phase of the project has already started and will see a further 36 million pages made available over the next nine years.

Commenting on the British Library's future plans, Townend said: "The library is exploring radio and other digital content for the archive website. We are also looking to archive the entire UK web domain and are running some trials for this."