Tuesday, July 17, 2007

Finding Related Web Pages

Google is the only major search engine that offers a "similar pages" feature, but not too many people use it. Launched in September 1999 as GoogleScout (scout=explore, investigate), the feature shows around 30 web pages related to a search result.

For example, to find sites related to Google Reader, you can click on the "similar pages" link placed after the snippet and you'll discover other feed readers, Google Reader's blog, information about feeds, blog platforms.


The related pages are generated by analyzing the link structure of the web. A patent from 2000 explains how this feature works: "a first set of hyperlinked documents that have a forward link to the selected hyperlinked document is provided. Additionally, a second set of hyperlinked documents that are pointed to by the forward links in the hyperlinked documents in the first set is provided. A value is assigned to each forward link in each of the hyperlinked documents in the first set, with the value being reduced for a forward link if there are multiple hyperlinked documents from the same host as the hyperlinked document that includes the forward link. A score is generated for each hyperlinked document in the second set according to the values of the forward links pointing to the hyperlinked document. Accordingly, a list of related hyperlinked documents is generated from the second set according to the score of the hyperlinked documents."

Basically, you're expecting that many sites that link to Google Reader will also link to its competitors and to related information. This is very similar to Amazon's recommendations: "customers who bought this item also bought".

How to use this features?

Unfortunately, Google's implementation has a major flaw: because many pages link to popular sites like Blogger, Flickr, StatCounter, you'll sometimes find these sites in the list of related links even if they're completely unrelated. Gred Linden calls this "the Harry Potter problem", when talking about Amazon's recommendation system. "The first version of similarities was quite popular. But it had a problem, the Harry Potter problem. Oh, yes, Harry Potter. Harry Potter is a runaway bestseller. Kids buy it. Adults buy it. Everyone buys it. So, take a book, any book. If you look at all the customers who bought that book, then look at what other books they bought, rest assured, most of them have bought Harry Potter."

So even if GoogleScout doesn't work well all the time, it's a great tool for research and serependitious discoveries (add a bookmarklet to your browser to use this feature for any site you visit). Another way to find related pages is to search for a site in Google Directory and to click on its category. Similicio.us uses the bookmarks from del.icio.us to complete this sentence: "people who bookmarked this site also bookmarked...", while the untrustworthy Alexa fills in the blanks for "people who visit this site also visit...". Google also uses similar ideas to provide recommendations based on your search history.

No comments:

Post a Comment