Friday, March 27, 2009

Hosted by Google, but Not Open to Search Engines

Like many other sites, Google uses robots.txt files to prevent search engines from indexing some of the content from google.com. In most cases, Google includes search results pages and other pages generated automatically, which would pollute indexes.


But sometimes Google excludes useful content, either directly using robots.txt files or using addresses that are difficult to index:

* published documents, spreadsheets and presentations from Google Docs - I suspect that the main reason why search engines aren't allowed to index Google Docs pages is that many documents would become public if search engines indexed invitation URLs.

* public pages for Google Reader's shared items - most of the content from these pages is copied from other pages, but Google Notebooks can be indexed by search engines.

* the albums and the photos hosted by Picasa Web Albums (the photos are indexed by Google Image Search, while the albums are included in Google's main search results). Picasa Web's front-end uses AJAX and URLs like http://picasaweb.google.com/guedin/AdriChezLesKiwisToutesLesPhotos12#5312778271091234418 can't be indexed by search engines, which usually remove fragments.

* the answers and questions from Google Moderator, another AJAX app that uses addresses like http://moderator.appspot.com/#15/e=cc&t=6. The application powers a new section from White House's website called "Open for Questions", which also can't be indexed by search engines.

* the LIFE photo archive, which is only available in Google Image Search. "It's disappointing that Google gets exclusive access to index these images and every other search engine is out of luck. Exclusivity like this doesn't seem in line with Google's philosophy," says Andy Baio.

* the books scanned by Google that are available in Google Book Search (they're included in Google's main search results, as part of Universal Search)

* the patents from the United States Patent and Trademark Office that are available in Google Patent Search

* the charts generated using Google Chart API

* the captions from videos hosted by YouTube and Google Video (they're indexed by YouTube and Google Video)

No comments:

Post a Comment