Some libraries close books to Google, Microsoft

Some libraries are choosing to pay to have their content digitized by the Open Content Alliance rather than having it scanned for free by Google or Microsoft, which refuse to allow access to the materials by rival search engines.

The Boston Library Consortium (BLC) is teaming with the Open Content Alliance (OCA) to build a library of digital materials that will be freely available via the Internet.

The BLC is composed of 19 academic and research libraries in Massachusetts, Connecticut, New Hampshire and Rhode Island. The consortium is digitizing all its content published before 1923. Content published before that date is considered in the public domain and not subject to copyright laws.

The cost for digitizing is $0.10 per page, and the BLC is funding the effort at a cost of $845,000 over two years. The work is also being supplemented by the OCA, which received a $2 million grant from the Alfred P. Sloan Foundation. Part of that grant will be used to digitize the John Adams Collection at the Boston Public Library, a member of the consortium.

The OCA was developed by the Internet Archive and search company Yahoo in early 2005 as a way to preserve a variety of content, such as digitized collections and multimedia. Yahoo doesn’t have a stand-alone book-search service.

The issue involves access to the digitized material. Search companies such as Google and Microsoft will scan the books for free, but want to restrict access for competitive reasons. The consortium wants access to its books available to anyone and in any search engine.

BLC Executive Director Barbara Preece said her organization selected the OCA because it kept the content search-engine neutral.

The OCA allows “you to hold onto your content and do whatever you want to do to your content, and it can be searched by any search engine whatsoever,” Preece said. “OCA was the best way for us to go to keep our content open. Google pretty much decides who you can share your content with. With OCA, it doesn’t matter what search engine you use to search the material. Google and Microsoft are interested in search, and the OCA is more interested in content and helping libraries handle their content the way they want to.”

Google spokesman Gabriel Stricker said the company designed its Book Search to promote the sharing and use of the content the company is digitizing, where appropriate. He said for books in the public domain, Google provides full access to the material, including the ability to read a book in its entirety, download a PDF to a computer and print a work for free. He said there are restrictions for books still under copyright to ensure that copyright holders are protected.

“The libraries we work with receive copies of all the digital files that they can use to serve their students, faculties and partners,” Stricker said in an e-mail. He added that libraries are also free to work with other organizations to digitize their content. Stricker did not directly respond to concerns that Google refuses to allow the material it digitizes to be available through other search engines.

Jay Girotto, group program manager for Microsoft’s Live Book Search, said his company has been involved with the OCA since October 2005.

“Microsoft put in much more than $2 million to fund the creation of a mass digitization program that could actually work,” Girotto said. “We digitized about 100,000 books under the OCA principles, and we were hoping there would be other significant financial contributors.” However, that didn’t happen, he said.

“We saw many people in the library community willing to adopt Google’s more restrictive stance around book search and sign up with Google, and we were faced with a decision about what to do,” he said. “We were essentially providing most of the capital that was building out [the program], but there were really no restrictions on Google taking the output of the process — the image file, the [optical character recognition] file and the metadata — and simply having the same use to it that Microsoft had.”

Girotto said Microsoft last November decided to put one restriction on the use of the material it was digitizing, which was that the material couldn’t be used by its commercial competitors, including Google, Yahoo and Ask.com. But Microsoft still doesn’t restrict distribution of copies of the books it digitizes for academic use among institutions, he said, although Google maintains this restriction.

Although Microsoft has remained part of the Internet Archive, technically, it is not part of the OCA because it is not operating under the OCA’s principle that requires material to be offered to the public without any restrictions.

Brewster Kahle, founder and digital librarian of the Internet Archive, said he has trouble with Google’s and Microsoft’s positions.

“Google is trying to build a collection themselves, and it’s trying to be the point of access for the library in the next generation,” Kahle said. “We have no problem with that in general, but we would just like lots of libraries in the future, not just one. Google doesn’t want anyone to have a search engine other than theirs, so the material they’ve digitized they’ve put restrictions on so they’re the only search engine going forward, and we don’t think that’s right. We think not only should there be many search engines, but many libraries and many archives as well.”

Regarding Microsoft, Kahle is a little more forgiving.

“Microsoft is paying for the digitization of books as well, and they allow any research and education use,” he said. “They just stipulate that the materials should not be put in other commercial services. They’re reacting to Google by tightening down somewhat. Google says nothing is allowed, period. But Microsoft says everything is allowed except commercial services, except for basically Google, so it’s a matter of degree.”

The Internet Archive, which is doing all the digitizing work, is still working on books funded by Microsoft. When Microsoft said commercial services couldn’t index all the material, that made them, strictly speaking, not adhering to OCA principles, Kahle said.

“So there’s Google, which is Draconian and very, very powerful, and there’s Microsoft, which is ‘minorly’ restrictive,” Kahle said. “This is about locking down our libraries, and the biggie is Google because they’re scanning millions of books and they’ve got these contracts with great libraries that are really kind of problematic — at least if any of us want to live in an open world, otherwise we could live in the land of Orwell.”

Some libraries, including the New York Public Library, have decided to partner with Google to make a collection of their books in the public domain available online. Others libraries that are also using Google include those at Michigan, Stanford, Harvard and Oxford universities. The Library of Congress is also working with Google on a pilot program to digitize some of its books.

The Biodiversity Heritage Library, however, is another group of libraries that has decided to pass on Google’s offer to digitize its material and, like the BLC, has decided to partner with the OCA.

“We talked to Google and Microsoft, but we will be digitizing our collection of biodiversity literature and making them freely available on the Net,” said Tom Garnett, director of the Biodiversity Heritage Library, a consortium of 10 major natural history museum libraries, botanical libraries and research institutions. “It’s not a question of good versus bad, or better than worse. It’s what is the nature and aim of your project.”

Garnett said what Google and Microsoft are doing will benefit everyone, but the nature of his project with biodiversity literature is heavily driven from the scientific research community. Not only do users have to be able to read the content, he said, but researchers also have to be able to manipulate it using software.

“We want people to be able to download portions of our literature to our own servers and do software algorithms to operate on it to link it or combine with other biological information,” he said. “While Google thought that was a good idea, they said it didn’t fit in with their business model. They said it wasn’t an option — maybe because there would be a series of exceptions that they’d have to do for our stuff versus someone else’s. Our stuff really needs to be available with as few hurdles as possible.”

Subscribe to the MacWeek Newsletter

Comments