Backfile to the future

Publishers are increasingly digitising back copies of journals and other print materials that pre-date the internet

Written by Gary Flood

Ever more professional content, such as journals and books, is born digital, with the material being created and assembled electronically. The bits and bytes of the digital version flow straight onto the web, where, increasingly, they can sit alongside their now equally electronic ancestors. But the story doesn’t end there, because publishers are also moving ever closer to digitising their entire back catalogues.

“It is not inevitable that all back catalogues will be digitised,” says Clive Parry, sales and marketing director at business and academic publisher Sage, “but there is a strong desire to have more content available electronically.”

Mark Holland, publishing director of Gale/Cengage Learning, is even more enthusiastic. “Somewhere approaching 100% of all content will be digital within five years,” he predicts.

Holland himself has extensive experience of digitisation, having managed the process on projects such as The Times Literary Supplement Centenary Archive, The Times Digital Archive 1785-1985, and Eighteenth Century Collections Online. More recently he has worked with digitising material for The Economist Historical Archive.

Many digitisation projects have built up an impressive amount of material, the sheer size of which is startling.

In the case of Gale Digital Collections, the programme began in 1999 with The Times Literary Supplement Centenary Archive and subsequently took in historical newspapers and periodicals, printed books, and government and other archives. It claims to cover 550 years of printed and manuscript materials (from 1500 to the present day) in over 100 million online pages. Gale has also recently partnered with the British Library to digitise its newspaper archives from the 18th and 19th centuries.

Physical enormity
The Association for Computing Machinery (ACM) meanwhile offers a digital library that started in 1997 and now includes 900,000 distinct articles, 100,000 books, 50,000 doctoral dissertations and 25,000 technical reports. “The amount of space such publications would take up in the physical world would be quite staggering,” says ACM group publisher Scott Delman.

Elsevier says it was the first scientific, technical and medical publisher to take on digitising its journals back to volume 1 issue 1, a task finally nearing completion eight years on. Most of the digitised journals made available electronically were finished by 2005; Elsevier says it is tracking down the last 0.3% of missing issues.

However big the digitisation project, it typically starts with a definition of what content will be covered and whether the company has a legal right to digitise it – an important consideration for the many publishers that have grown by acquisition. Then there is the considerable issue of sourcing older content, which may have been archived and catalogued inadequately.

Lindi Belfield, senior product manager at Elsevier, says that tackling the legal issues and assembling the content can be a real challenge: “For the backfile project for ScienceDirect we had to use catalogues going back to the 1950s to see what we had published, as well as searching A&I [abstract and indexing] databases to see who owned [the titles] between what years.

“Societies had to be contacted to request the electronic rights in perpetuity of their journal in order to include them in the programme.

“And to get all the copies we could get our hands on, we had to root them out of libraries, warehouses, publishers, meeting rooms, corridors, institutes and companies – a few chief editors’ graves have been dug up too!”

A crucial factor in the digitisation process is making adequate provision to access the content at a later date. How the process is managed is of paramount importance if the archived information is to be of use in the future.

Great expectations
User expectations of digitised material are understandably high and meeting them has been a challenge for some organisations.

“Probably our biggest challenge was that we gradually started getting complaints about poor scans of greyscales and images,” says Belfield. “In the beginning there were hardly any as the initial packages were not image-heavy, but by 2003 we started getting poorly scanned pages.”

Elsevier is now rescanning all the poorly captured images, a process that Belfield says may take another year.

ACM encountered the same problem. “We discovered that photos, halftones and graphics needed to be scanned and digitised at a much higher resolution than plain text,” says Delman. “Nearly 10% of our original archive had to be rescanned to achieve the desired level of quality.”

Getting the quality right on all aspects of digital content is critical. Even just putting material online is far from a simple step. “There is a danger that if users can’t find the content they want through a search engine, they won’t access it at all,” says Parry. He adds that publishers manage the problem through a discoverability group, which researches how users access content to ensure it is as visible as possible.

Once the material is fully searchable and the digitised collection of content made readily available, the question then arises of who will want it. Part of the problem publishers face is that they have tended to make backfiles available as part of wider bundles of content, and not all customers necessarily want that, says Judith Broady-Preston, leader of the CILIP Council and an academic in the department of information studies at Aberystwyth University.

“I find such bundling an expensive means of using scarce resource budgets,” she says. “It is also counter-productive in that I consider I have a reasonably extensive knowledge of journals of relevance to my research and teaching and wish to make personal use of (or direct students towards) specific papers, learning objects or items. I dislike having my choices made for me in this way.

“Having been asked to evaluate such bundles I find that, in the main, there are possibly one or two useful items, with the remainder makeweight padding.”

Broady-Preston accepts there is some value in having all issues of a particular journal or textbook available online, but says: “Surely this can be achieved without bundling the titles you don’t wish to access with those you do.”

Publishers, though, stress they are not forcing digitisation on a reluctant world and that all such efforts have been carefully market-researched.

Belfield says: “Digitisation was a big investment risk as it was by no means certain that customers would buy it or end-users use it.”

Collective procurement
“A lot of libraries have real challenges around physical space limitations for storage, but they also prefer to buy not just individual publications on their own but bundles as consortia,” says Parry.

Sage itself offers an online archive of 400-plus journals, some dating back to 1879, or nearly three million pages. “Users get a lot more content this way,” says Parry, who points out that digitisation doesn’t come for free: “Digitising content is a major investment and the cost implications need to be carefully weighed up against likely return.”

Holland agrees: “We haven’t just flung this stuff into the market blind.”

So are users starting to really appreciate the benefits that digitisation can offer?

“In the beginning a lot of librarians were dubious,” admits Belfield. “They claimed their users only wanted about 10 years of archives – quite logical as many libraries could not afford the shelf space for more, and back-years were packed away in cellars. Users got used to only taking what was available in many cases.

“But then when you can offer an original article from Fleming or Einstein, the science is still good and citable.”

Delman says: “Users have become more dependent on the internet for doing research and accessing technical information. Improvements in search technology and online distribution platforms have also had an impact. The demand has created an incentive for publishers to digitise their archives and develop a business strategy around selling those archives or including them as part of their overall value propositions.”

Belfield believes it is the completeness aspect of digitisation that will start to win people over. “Librarians are collectors and often want everything, preferably online,” she says. “End-users also want to know that they have got it all, especially when researching a new field.”

And don’t forget the real power of digitisation metadata is that it creates the opportunity for cross-database searching.

Holland says: “We are currently developing a platform to enable searching across many different kinds of content: reference works, e-texts, historical materials of all kinds. This will get rid of all the barriers which have existed historically between individual content silos.”

The reason for the initiative, Holland says, is because “effective online publishing must always present functionality that is specific to the type of material being offered. For example, newspaper archives allow researchers to distinguish their searches between specific segments of the content: editorial, letters, special supplements, advertising, and so on.”

Outsell vice president Dan Pollock says “next-generation search” is necessary because “there’s now too much information out there, and to differentiate themselves publishers will have to go beyond basic Google-style searching and make it easier to discover what you are looking for.

“That means being able not just to search on words but also the ideas behind them. If you type in ‘MI’, do you mean myocardial infarction or the state of Michigan? We are not there yet, of course.”

Whether next-generation search can cope with such demands is something that remains to be seen.

“The world would be very convenient if all the information we needed was available electronically,” says Nick Fowler, director of strategy at Elsevier. “We could slice and dice and integrate what we needed and link it to other systems very easily.”

Fraught forecasts
As physicist Niels Bohr once joked, prediction is very difficult, especially if it’s about the future. There is also a whole set of legal and copyright issues, as well as cost and complexity problems, around the large-scale harvesting of print information. The day when the whole world’s knowledge is available via a browser tool such as Google’s Book Search may still be a long way off. Few of the suppliers IWR spoke to have made any kind of serious inroads into the digitisation of their book repertoire, so an online bank of all knowledge remains firmly in the realm of fiction.

But Broady-Preston suspects that future features may fail to live up to the hype.

“I wait to be convinced,” she says. “It is tempting to think that digitisation is a mechanism for ensuring a market for less successful publications. But this may be too cynical a view.”

Parry believes that no-one has really quite worked out what will happen next. “We are at the start of things like Web 2.0 and the semantic web and there are lots of experiments going on,” he says. “What is certain is that there is lots of discussion and interest in the market here and that digital content is key to what will happen next.”

It is no longer a question of why digitise the back catalogue; that process is well under way. The real question is this: what will we do with all that mass of knowledge?

It’s an exciting and challenging question for information professionals and researchers alike.

From scanning to metadata
Actually implementing a digitisation project involves a number of steps.

Once the content has been assembled, a technology partner is usually chosen to actually capture the material. Scanning the hard copy text can be a laborious, time-intensive task that needs specialist equipment.

Best practice is to draw up a specific project plan to ensure that, in the words of Elsevier’s Belfield, “expectations are known, feasible and reasonable” on both the client (publisher) and digitisation partner’s side. Normal project disciplines also apply.

A successful digitisation exercise should result in a backfile collection that can then be marketed as a product.

Once the backfile is pulled together it is customarily sold as a one-off purchase, so customers do not pay annual or additional maintenance fees. Outright purchase might not suit everyone; a corporate with less interest in building up archives than a government or academic institution may prefer a subscription model to get specific backfile data when needed.

Even after such a painstaking process, a publisher may not be completely out of the woods. What makes a digitised source much more useful is if it is captured in a way that permits searching and classification.

Accurate metadata collection and compilation during the process so that the article from 1899 can easily be linked to the one from 1999 is vital, but it can also be a complex and laborious task. Outsell’s Dan Pollock says that if the publisher doesn’t do this effectively, then the whole digitisation exercise will have been largely redundant. “You can’t run effective search with just images,” he says. “The text must be extricated too.”

In the case of Gale/Cengage’s digitisation work, original documents are scanned directly or from existing film. If the quality of any printed materials is good enough to deliver useful fully searchable text, multiple optical character recognition engines are run across the documents, says Holland, a process that will get most of the content into machine-readable form.

But that still isn’t enough. Additional metadata is then required to identify each item individually and put it in the right place in the embedded data structure. XML is a key technology here.

  • Have your say
  • Send to a friend
  • Print this
  • Share

Tags:

reader comments

related articles

 

European institutions come together to ease mass digitisation

The British Library and the University of Salford, along with 15 other European institutions are starting a project - IMPACT – to remove the barriers that stand in the way of mass digitisation of the European cultural heritage 03 Jul 2009

Elsevier launches SciVal Funding

The comprehensive funding intelligence platform, launched by the STM publisher, will help researchers locate the most appropriate grant opportunities and maximise their potential to receive funding 29 Sep 2009

related whitepapers

today's top stories

Face facts: social media is the future

No organisation can afford to ignore the way business communications are changing 18 Mar 2010

Is the data watchdog about to pounce?

Experts believe the Information Commissioner’s Office is itching to use its new power to impose hefty fines for data breaches. Martin Courtney reports 18 Mar 2010

Lloyd’s of London gears up for regulation

CIO Peter Hambling tells Angelica Mari about how the insurance market has updated its IT infrastructure to comply with new regulations 18 Mar 2010

Protests greet new Digital Economy Bill amendment

ISPs, digital rights groups and Liberal Democrat supporters cry foul 05 Mar 2010

IT Leaders' Forum in association with IBM

A unique opportunity to hear from expert speakers and engage in a debate about the future of the CIO job function 29 Jan 2010

Advertisement

Keys to successful Service‐Oriented Architecture implementation

This white paper explores best practices and general design patterns for service oriented architecture (SOA).

The Roadmap to IT Maturity — Matching Strategy to Infrastructure for Business Success

This paper defines a roadmap for matching infrastructure strategy to business success.

Advertisement

Keep up to date with the latest products, services and technologies from the world's leading IT companies; ITHound.com brings you over 6,000 white papers, case studies and analyst reports.

Advertisement

Newsletter signup

Sign up for our range of FREE newsletters:

More available - click 'submit' to view

Existing User

Newsletter user login:

Jobs

Related jobs

Job of the week

Job alerts

Sign up here

Find your next job

IT Salary Checker

Check salary here

Advertisement

Latest poll

NHS centralised data

NHS centralised data

Do you think the NHS can be trusted to safely look after personal data electronically?

View poll results

Latest audio and video articles

Video

HP unveils S Series notebooks

'Prosumer' line overhauled 01 Mar 2010

Web Seminar Listings

Preparing for enterprise-scale Windows 7 migration

The web seminar on 18 Feb will discuss how Windows 7 migration can increase IT efficiency in large enterprises, freeing up budgetary and personnel resources to focus on business innovation. Our panel of experts will examine the strategies, tools and services IT leaders can use to migrate successfully and reap the rewards of increased efficiency. 19 Feb 2010

Latest in-depth articles

Smiths Group CIO Brian JonesAnalysis

Q&A: Brian Jones, CIO, Smiths Group

How should conglomerates be looking at the new IT technologies coming through? Brian Jones explains. 19 Mar 2010

Analysis

What security strategy should enterprises adopt after the recession?

Act now to put your your firm on higher growth path advise CISOs 19 Mar 2010

Primary Navigation