Monday, October 22, 2007

Remove Spam from Google Blog Search

Even if Google Blog Search doesn't have too many interesting features, I still use it more often than Technorati because it's faster, it's not down for hours, it's much more comprehensive and it has features not available in any other important blog search engine. I still use Technorati for finding backlinks, because Google does a poor job in this area (compare Technorati with Google Blog Search). Unfortunately, Google Blog Search indexes a lot of spam posts that steal content and use it for lucrative purposes.

Google has two features that reduce the number of splogs (spam blogs) from search results. Like in web search, there's a duplicate filter that removes some of the posts that are almost identical. But it doesn't exclude all of them and it doesn't find posts that duplicate articles from news sites like Business Week.


The second feature is the option to sort results by relevancy, which is enabled by default. It may seem counterintuitive to sort blog search results by relevancy and not chronologically, but that's a great way to filter splogs or at least move them at the bottom of Google's search results. Google uses a lot of signals to rank blog posts, including PageRank, the number of feed subscriptions or the amount of duplicate content. But if you sort the results by relevancy, you'll find both recent and old posts and that's not always the optimal solution. A better way is to restrict the results to a recent period of time in the sidebar (to the last day or the last hour, depending on the volume of posts).


If you see a "References" link after the snippet, that's an indication that Google found (a significant number of) backlinks, so the result should be a little more reliable.

Many blogs use Google Alerts to pollute the web and make money, so you could also add [-"google alert"] to your query (a search for "google alert" returns more than 200,000 results). A lot spam blogs are hosted by Google's Blog*Spot, so removing the posts from blogspot.com could increase the quality of your results, but also remove non-spammy blogs like this one or Google's official blogs. I also noticed that many spam blogs use the .info TLD. A recent study showed that, when searching for commercial keywords, 75% of the results from blogspot.com and 68% of the results from .info sites are spam.

It's also a great idea to restrict the result to English (or another language) in "Advanced blog search".

So here's a summary:

1. sort the results by relevancy
2. restrict the results to a recent period (last day)
3. restrict the results to English (or another language)
4. if you really have to sort the results by date, remove the posts that follow a spammy pattern (for example, add -"google alert" -site:blogspot.com -site:.info to your query), but make sure you don't remove important results
5. check the posts that contain "References"

Google should do a better job at detecting spam in Blog Search results and identifying results from sites that happen to have feeds, but they're not blogs. It should also make it more difficult for spammers to use sites like Blogger or Google Alerts to pollute the search results.

No comments:

Post a Comment