Wednesday, August 29, 2007

The Quality of Google Book Search


Paul Duguid wrote an interesting article about Google Book Search in which he analyzed the quality of the indexed editions and the search results by doing a search for Lawrence Sterne's "Tristram Shandy", a novel from the 18th century. Mr. Duguid noticed that the Harvard edition of the book had many quality problems and some text wasn't scanned properly. Google Book Search doesn't distinguish between the volumes of a book, so it's difficult to realize that the Stanford edition is actually the second volume of the book.
Google may or may not be sucking the air out of other digitization projects, but like Project Gutenberg before, it is certainly sucking better–forgotten versions of classic texts from justified oblivion and presenting them as the first choice to readers. (...) The Google Books Project is no doubt an important, in many ways invaluable, project. It is also, on the brief evidence given here, a highly problematic one. Relying on the power of its search tools, Google has ignored elemental metadata, such as volume numbers. The quality of its scanning (and so we may presume its searching) is at times completely inadequate. The editions offered (by search or by sale) are, at best, regrettable. Curiously, this suggests to me that it may be Google's technicians, and not librarians, who are the great romanticisers of the book. Google Books takes books as a storehouse of wisdom to be opened up with new tools. They fail to see what librarians know: books can be obtuse, obdurate, even obnoxious things. As a group, they don't submit equally to a standard shelf, a standard scanner, or a standard ontology.

Patrick Leary, the author of the article Googling the Victorians (PDF), has a pragmatical response, as seen on O'Reilly Radar:
Mass digitization is all about trade-offs. All mass digitizing programs compromise textual accuracy and bibliographical meta-data so that they can afford to include many more texts at a reasonable cost in money and time. All texts in mass digitization collections are corrupt to some degree. Everything else being equal, the more limited the number of texts included in a digital collection, the more care can be lavished on each text. Assessing the balance of value involved in this trade-off, I think, is one of the main places where we part company. You conclude, on the basis of your inspection of these two volumes, that the corruption of texts like Tristram Shandy makes Google Books a "highly problematic" way of getting at the meanings of the books it includes. By contrast, while acknowledging how unfortunate are some of the problems you mention, I believe that the sheer scale of the project and the power of its search function together far outweigh these "problematic" elements.

When scanning and indexing millions of books, it's difficult to assess the quality of each edition. Google Book Search's main goal is to let you discover books you can borrow or buy later on. But Google could add an option to rate the quality of each digitized book or build algorithms that detect flaws or differences between editions. So the next time you do a search for Tristram Shandy, all the editions are clustered and the best one comes up first.

No comments:

Post a Comment