Laura Diane Hamilton

Technical Product Manager at Groupon

Resumé

How Google Uses Machine Learning to Detect Spam Blogs (Maybe)

As long as people use search engines to navigate the internet, there will continue to be fly-by-night webspammers who use spam blogs (or "splogs"), link farms, and stolen or machine-generated content to drive artificial and illegitimate web traffic.

And as long as there are fly-by-night webspammers, Google will be trying to enhance their algorithms to detect, penalize, and ban the webspammers.

So there's essentially an arms race between Google and the webspammers, with evolving technology on both sides.

Disclaimer: I do not work for Google, I have never worked for Google, and I have no inside knowledge (or, indeed, any actual knowledge) of how their anti-webspam algorithm actually works. Rather, in this post I am speculating about some things Google might be doing in their algorithm. I am using only publicly available information (academic research articles and empirical analysis of search results).

Much of my speculation about Google anti-webspam is based on Detecting Spam Blogs: A Machine Learning Approach by Pranam Kolari, Akshay Java, Tim Finin, Tim Oates, and Anupam Joshi.

What is the incentive for spam blogs? How do the spammers profit?

It looks more or less like this:

  1. Webspammers programmatically create fake blogs, using either gibberish (machine-generated) content or content they have scraped (stolen) from legitimate sites. Often they will look for what people are frequently searching for, then make that the page title, and fill in the post with garbage content.
  2. Webspammers will start to get rankings for their splogs
  3. Webspammers will then link from the splogs to link farms, then to affiliated sites
  4. The affiliate sites will get the link juice and PageRank from these ill-gotten links, and will come up in search queries more often — resulting in more traffic and more customers or advertising sales.

Here is an example of a bunch of splogs in search results. As a human it is ridiculously easy to tell that these are spam. To be fair to Google, though, I needed to use a pretty specialized query and go fairly deep into the search results in order to get some examples of splogs. So, that indicates to me that their anti-blogspam algorithm is pretty good.

The question of blogspam is interesting, because blogs have a number of features that make them relatively more susceptible to spam:

  • Blog searches rely heavily on recency — people really want the freshest content — which means that authority and links (traditional indicators of site quality) are relatively less important.
  • There are blog services that are free, quick to set up, and even those that can be accessed programmatically (via APIs).
  • Blogs often belong to separate searches (e.g., Google's blog search).

I think that they are using Support Vector Machine (SVM) models to take the blog content and other features and then use that to predict whether a blog is legitimate or spammy. (SVM is a type of machine-learning algorithm.)

Here are some blog features that they are likely to be using:

  1. Usage frequency of various words. For example, as the authors of the paper note, "blogs often contain content that expresses personal opinions, so words like 'I', 'We', 'my,' and 'what' are common in authentic blog posts." However, this type of first-person language is not often found in spam blog posts, nor really on non-blog web content. The authors used a standard machine-learning technique called the "bag of words" technique (often used frequently in email spam filters, for example).
  2. Whether comments are turned on or off. According to the researchers, comments are typically turned off in splogs. Which makes sense; webspammers don't want to give away any of their precious link juice to commenters! (I don't expect they have very many commenters, though.)
  3. Real bloggers carefully put each post into a category (or give it tags), whereas sploggers tend to carelessly dump their posts into the "uncategorized" bucket.
  4. Sploggers are more likely to include high-paying advertising terms such as "new york."
  5. Real bloggers have comments, even if just 1 or two, whereas sploggers tend not to have any comments.
  6. Real blogs include key first-person phrases such as "I have," and "to my," whereas machine-generated content typically has none.
  7. Real blogs will contain anchor text "comment" and "flickr"; real posters are their vacation photos, for example.
  8. Real blogs link to legitimate sites such as twitter, facebook, or wikipedia, whereas splogs are more likely to link to sketchy .info urls. (These are likely to be link farms or similar.)
  9. Intuitively, it makes sense that real blogs link to other real blogs (and never splogs), whereas splogs link to other splogs. The authors of the paper say that this sort of network analysis didn't really improve the predictive power of the model beyond the content-related features, but Google has a lot more data than they do, and it's not at all far-fetched that Google has a more sophisticated way of analyzing the network graphs to determine if a blog is a splog or not.

Further reading:

Lauradhamilton.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to amazon.com.