Ever more professional content, such as journals and books, is born digital, with the material being created and assembled electronically. The bits and bytes of the digital version flow straight onto the web, where, increasingly, they can sit alongside their now equally electronic ancestors. But the story doesn’t end there, because publishers are also moving ever closer to digitising their entire back catalogues.
“It is not inevitable that all back catalogues will be digitised,” says Clive Parry, sales and marketing director at business and academic publisher Sage, “but there is a strong desire to have more content available electronically.”
Mark Holland, publishing director of Gale/Cengage Learning, is even more enthusiastic. “Somewhere approaching 100% of all content will be digital within five years,” he predicts.
Holland himself has extensive experience of digitisation, having managed the process on projects such as The Times Literary Supplement Centenary Archive, The Times Digital Archive 1785-1985, and Eighteenth Century Collections Online. More recently he has worked with digitising material for The Economist Historical Archive.
Many digitisation projects have built up an impressive amount of material, the sheer size of which is startling.
In the case of Gale Digital Collections, the programme began in 1999 with The Times Literary Supplement Centenary Archive and subsequently took in historical newspapers and periodicals, printed books, and government and other archives. It claims to cover 550 years of printed and manuscript materials (from 1500 to the present day) in over 100 million online pages. Gale has also recently partnered with the British Library to digitise its newspaper archives from the 18th and 19th centuries.
Physical enormity
The Association for Computing Machinery (ACM) meanwhile offers a digital library
that started in 1997 and now includes 900,000 distinct articles, 100,000 books,
50,000 doctoral dissertations and 25,000 technical reports. “The amount of space
such publications would take up in the physical world would be quite
staggering,” says ACM group publisher Scott Delman.
Elsevier says it was the first scientific, technical and medical publisher to take on digitising its journals back to volume 1 issue 1, a task finally nearing completion eight years on. Most of the digitised journals made available electronically were finished by 2005; Elsevier says it is tracking down the last 0.3% of missing issues.
However big the digitisation project, it typically starts with a definition of what content will be covered and whether the company has a legal right to digitise it – an important consideration for the many publishers that have grown by acquisition. Then there is the considerable issue of sourcing older content, which may have been archived and catalogued inadequately.
Lindi Belfield, senior product manager at Elsevier, says that tackling the legal issues and assembling the content can be a real challenge: “For the backfile project for ScienceDirect we had to use catalogues going back to the 1950s to see what we had published, as well as searching A&I [abstract and indexing] databases to see who owned [the titles] between what years.
“Societies had to be contacted to request the electronic rights in perpetuity of their journal in order to include them in the programme.
“And to get all the copies we could get our hands on, we had to root them out of libraries, warehouses, publishers, meeting rooms, corridors, institutes and companies – a few chief editors’ graves have been dug up too!”
A crucial factor in the digitisation process is making adequate provision to access the content at a later date. How the process is managed is of paramount importance if the archived information is to be of use in the future.
Great expectations
User expectations of digitised material are understandably high and meeting them
has been a challenge for some organisations.
“Probably our biggest challenge was that we gradually started getting complaints about poor scans of greyscales and images,” says Belfield. “In the beginning there were hardly any as the initial packages were not image-heavy, but by 2003 we started getting poorly scanned pages.”
Elsevier is now rescanning all the poorly captured images, a process that Belfield says may take another year.
ACM encountered the same problem. “We discovered that photos, halftones and graphics needed to be scanned and digitised at a much higher resolution than plain text,” says Delman. “Nearly 10% of our original archive had to be rescanned to achieve the desired level of quality.”
Getting the quality right on all aspects of digital content is critical. Even just putting material online is far from a simple step. “There is a danger that if users can’t find the content they want through a search engine, they won’t access it at all,” says Parry. He adds that publishers manage the problem through a discoverability group, which researches how users access content to ensure it is as visible as possible.
Once the material is fully searchable and the digitised collection of content made readily available, the question then arises of who will want it. Part of the problem publishers face is that they have tended to make backfiles available as part of wider bundles of content, and not all customers necessarily want that, says Judith Broady-Preston, leader of the CILIP Council and an academic in the department of information studies at Aberystwyth University.
“I find such bundling an expensive means of using scarce resource budgets,” she says. “It is also counter-productive in that I consider I have a reasonably extensive knowledge of journals of relevance to my research and teaching and wish to make personal use of (or direct students towards) specific papers, learning objects or items. I dislike having my choices made for me in this way.
“Having been asked to evaluate such bundles I find that, in the main, there are possibly one or two useful items, with the remainder makeweight padding.”
Broady-Preston accepts there is some value in having all issues of a particular journal or textbook available online, but says: “Surely this can be achieved without bundling the titles you don’t wish to access with those you do.”
Publishers, though, stress they are not forcing digitisation on a reluctant world and that all such efforts have been carefully market-researched.
Belfield says: “Digitisation was a big investment risk as it was by no means certain that customers would buy it or end-users use it.”
Collective procurement
“A lot of libraries have real challenges around physical space limitations for
storage, but they also prefer to buy not just individual publications on their
own but bundles as consortia,” says Parry.
Sage itself offers an online archive of 400-plus journals, some dating back to 1879, or nearly three million pages. “Users get a lot more content this way,” says Parry, who points out that digitisation doesn’t come for free: “Digitising content is a major investment and the cost implications need to be carefully weighed up against likely return.”
Holland agrees: “We haven’t just flung this stuff into the market blind.”
So are users starting to really appreciate the benefits that digitisation can offer?
“In the beginning a lot of librarians were dubious,” admits Belfield. “They claimed their users only wanted about 10 years of archives – quite logical as many libraries could not afford the shelf space for more, and back-years were packed away in cellars. Users got used to only taking what was available in many cases.
“But then when you can offer an original article from Fleming or Einstein, the science is still good and citable.”
Delman says: “Users have become more dependent on the internet for doing research and accessing technical information. Improvements in search technology and online distribution platforms have also had an impact. The demand has created an incentive for publishers to digitise their archives and develop a business strategy around selling those archives or including them as part of their overall value propositions.”
Belfield believes it is the completeness aspect of digitisation that will start to win people over. “Librarians are collectors and often want everything, preferably online,” she says. “End-users also want to know that they have got it all, especially when researching a new field.”
And don’t forget the real power of digitisation metadata is that it creates the opportunity for cross-database searching.
Holland says: “We are currently developing a platform to enable searching across many different kinds of content: reference works, e-texts, historical materials of all kinds. This will get rid of all the barriers which have existed historically between individual content silos.”
The reason for the initiative, Holland says, is because “effective online publishing must always present functionality that is specific to the type of material being offered. For example, newspaper archives allow researchers to distinguish their searches between specific segments of the content: editorial, letters, special supplements, advertising, and so on.”
Outsell vice president Dan Pollock says “next-generation search” is necessary because “there’s now too much information out there, and to differentiate themselves publishers will have to go beyond basic Google-style searching and make it easier to discover what you are looking for.
“That means being able not just to search on words but also the ideas behind them. If you type in ‘MI’, do you mean myocardial infarction or the state of Michigan? We are not there yet, of course.”
Whether next-generation search can cope with such demands is something that remains to be seen.
“The world would be very convenient if all the information we needed was available electronically,” says Nick Fowler, director of strategy at Elsevier. “We could slice and dice and integrate what we needed and link it to other systems very easily.”
Fraught forecasts
As physicist Niels Bohr once joked, prediction is very difficult, especially if
it’s about the future. There is also a whole set of legal and copyright issues,
as well as cost and complexity problems, around the large-scale harvesting of
print information. The day when the whole world’s knowledge is available via a
browser tool such as Google’s Book Search may still be a long way off. Few of
the suppliers IWR spoke to have made any kind of serious inroads into the
digitisation of their book repertoire, so an online bank of all knowledge
remains firmly in the realm of fiction.
But Broady-Preston suspects that future features may fail to live up to the hype.
“I wait to be convinced,” she says. “It is tempting to think that digitisation is a mechanism for ensuring a market for less successful publications. But this may be too cynical a view.”
Parry believes that no-one has really quite worked out what will happen next. “We are at the start of things like Web 2.0 and the semantic web and there are lots of experiments going on,” he says. “What is certain is that there is lots of discussion and interest in the market here and that digital content is key to what will happen next.”
It is no longer a question of why digitise the back catalogue; that process is well under way. The real question is this: what will we do with all that mass of knowledge?
It’s an exciting and challenging question for information professionals and researchers alike.
From scanning to metadata
Actually implementing a digitisation project involves a number of steps.
Once the content has been assembled, a technology partner is usually chosen to actually capture the material. Scanning the hard copy text can be a laborious, time-intensive task that needs specialist equipment.
Best practice is to draw up a specific project plan to ensure that, in the words of Elsevier’s Belfield, “expectations are known, feasible and reasonable” on both the client (publisher) and digitisation partner’s side. Normal project disciplines also apply.
A successful digitisation exercise should result in a backfile collection that can then be marketed as a product.
Once the backfile is pulled together it is customarily sold as a one-off purchase, so customers do not pay annual or additional maintenance fees. Outright purchase might not suit everyone; a corporate with less interest in building up archives than a government or academic institution may prefer a subscription model to get specific backfile data when needed.
Even after such a painstaking process, a publisher may not be completely out of the woods. What makes a digitised source much more useful is if it is captured in a way that permits searching and classification.
Accurate metadata collection and compilation during the process so that the article from 1899 can easily be linked to the one from 1999 is vital, but it can also be a complex and laborious task. Outsell’s Dan Pollock says that if the publisher doesn’t do this effectively, then the whole digitisation exercise will have been largely redundant. “You can’t run effective search with just images,” he says. “The text must be extricated too.”
In the case of Gale/Cengage’s digitisation work, original documents are scanned directly or from existing film. If the quality of any printed materials is good enough to deliver useful fully searchable text, multiple optical character recognition engines are run across the documents, says Holland, a process that will get most of the content into machine-readable form.
But that still isn’t enough. Additional metadata is then required to identify each item individually and put it in the right place in the embedded data structure. XML is a key technology here.










reader comments