payday loans car Insurance

Archive

Posts Tagged ‘black hat’

Search Engine Series: Indications of Web Spam

May 4th, 2009

A patent application from Microsoft looks at content generated to spam search engines. Here’s the problem, as noted in the patent filing:

In the best case, search engine optimizers help web site designers generate content that is well-structured, topical, and rich in relevant keywords or query terms. Unfortunately, some search engine optimizers go well beyond producing relevant pages: they try to boost the ratings of a web site by loading pages with a wide variety of popular query terms, whether relevant or not. In fact, some SEOs go one step further: Instead of manually creating pages that include unrelated but popular query terms, they machine-generate many such pages, each of which contains some monetizable keywords (i.e., keywords that have a high advertising value, such as the name of a pharmaceutical, credit cards, mortgages, etc.). Many small endorsements from these machine-generated pages result in a sizable page rank for the target page. In a further escalation, SEOs have started to set up DNS servers that will resolve any host name within their domain, and typically map it to a single IP address.

Most if not all of the SEO-generated pages exist solely to mislead a search engine into directing traffic towards the “optimized” site; in other words, the SEO-generated pages are intended only for the search engine, and are completely useless to human visitors.

I recognized this quote, which is taken from an interesting research paper from Microsoft, Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages. If you are interested in how search engines are attempting to fight web spam, it’s a “must read” paper.

 

It appears that this patent is an attempt to take some of the research reported upon in that paper, and define a way to use it in a process that can help the search engine fight web spam. But, it isn’t a rehashing of that paper, and it covers some new territory. Definitely worth a look, especially if you are concerned that your pages may be mistaken for spam by the search engines.

Using content analysis to detect spam web pages
Inventors: Marc Alexander Najork, Dennis Craig Fetterly, Mark Steven Manasse, and Alexandros Ntoulas
Assigned to Microsoft
US Patent Application 20060184500
Published August 17, 2006
Filed: February 11, 2005

Abstract

Evaluating content includes receiving content, analyzing the content for web spam using a content-based identification technique, and classifying the content according to the analysis. An index of analyzed contents may be created. A system for evaluating content includes a storage device configured to store data and a processor configured to analyze content using content-based identification techniques to determine whether web spam is present.

The patent describes some measures that the authors may be looking at when viewing the content of a page to determine whether or not the page is intended only to spam a search engine. The authors note that other steps and other metrics may also be involved.

Classification of Content

Metrics about pages are collected and fed into a classifier program which uses weighted scores to distinquish good pages from bad ones. The classifier program starts with an initial data set, called the training set, which is divided into positive and negative examples. That training set looks at all of the features of the positive and negative examples in combination, in an attempt to separate the positive examples (non-spam) from the negative examples (spam).

Using a classifier like this may mean that once the dividing line is made, additional data may be looked at to see if it can be used to distinquish good pages from bad ones. We know from the “Spam, Damn Spam, and Statistics” paper that Microsoft is also looking at other features of pages and sites.

According to the patent filing, some classes of spam web pages can be detected by analyzing the content of the page and looking for “unusual” properties, such as:

  • The page contains unusually many words,
  • The page contains unusually many words within a title HTML element (<title>here!</title>)
  • The ratio of HTML markup to visible text is low,
  • The page contains an unusually large number of very long or very short words,
  • The page contains repetitive content,
  • The page contains unusually few common words (”stop words”), or
  • The page contains a larger-than-expected number of popular n-grams (sequences of n words)

These metrics or filters can be input into a classifier for deciding whether or not a page is spam or determining the likelihood or probability that the page is spam, by comparing the outputs of one or more of the metrics, alone or in combination, to one or more thresholds.

The patent mentions an example reference book which describes the existing body of work in machine learning: Pattern Classification (my link doesn’t go to the book itself, but rather to a page from one of the authors, which has a great series of powerpoint slides about material in the book).

Identifying Web Spam on the Fly

The patent describes methods for finding spam pages during web crawls and or evaluating content on the fly.

Here’s a summary of the process for identifying spam through content, on the fly, from the patent application:

  1. Search engine receives user input to begin a particular query,
  2. Search engine performs the query,
  3. Search engine receives the query results,
  4. Search engine (or processor or classifier, for example) evaluates the results using various metrics,
  5. After evaluation the search engine analyzes the evaluations to determine what contents are likely web spam.
  6. From that analysis, the search engine may identify web pages as web spam and may record or store the contents in an index for future queries,
  7. Query results are then output to the searcher.
  8. Detected web spam could excluded from a search engine index, given a low search ranking, or treated in a manner so that user queries are not affected or populated with web spam, which could lead to more relevant search results, or at least the omission of some irrelevant results.

Indications of Web Spam?

The list above of some “unusual properties” that may be looked for is examined in greater detail within the patent application. The following are paraphrases of some of those and some additional metrics. I’d recommend looking at the patent for their more detailed treatment of these. Keep in mind that many are just one factor to be looked at in conjunction with the others before a determination is made that a page is intended to spam a search engine.

1. As the number of words on a page increases, the probability of spam being present on that page increases.

2. As the number of words in the title of a web page increases, the probability of web spam being present dramatically increases.

3. As the visible content of the page increases, the probability of web spam being present increases to a point and then decreases dramatically.

4. As the fraction of anchor words increases (as a percentage of all the words on a page), the probability of web spam increases.

5. Web spam is more likely to occur in web pages having very long or very short words, so an average word length metric can be used to identify spam pages.

6. As the zipRatio of a page increases beyond a threshold, the probably of web spam being present on a web pages increases dramatically. A zip ratio is calculated by dividing the size (in bytes) of uncompressed visible text (such as text other than HTML markup) by the size (in bytes) of compressed visible text.

7. As a percentage (and distribution) of stop words (the most commonly used words in a search engine corpus) used on a page deceases, the probablility of web spam increases.

For example, the 100 most common words in a very large corpus representative of the English language is determined, e.g., by examining all the English web pages downloaded by the crawler (the same applies to other languages as well). It is then determined what fraction of the words on a single web page is drawn from the 100 most frequent words in the entire corpus. For example, words like “the”, “a”, “from”, etc. are among the 100 most frequent English words. If a web page had no occurrences of any of these words, but 100 occurrences of “echidna” (a spiny anteater and a rare word), it is determined that the page has 0% overlap with the top-100 words.

8. Pages are also reviewed for the existence of commonly ocurring sequences of consecutive words (n-grams), their position within a document, and commonly ocurring words that may appear after those sequences. Probabilities of those are calculated from documents on the web, and thresholds are defined which could be used to determine whether or not a page should be identified as web spam.

What is Black Hat SEO?

April 30th, 2009

This is a question that has crossed all of our minds at one point or another. Stumble into a Black Hat discussion on a general internet marketing forum and you are likely left with the impression that we are evil scum of the Earth that is out to destroy the very fabric of society while murdering baby kittens. 

The reality however, is far from that. The majority of Black Hat SEO employs the EXACT same strategies and techniques used by any good White Hat marketer, but with a twist. We’re lazy….. So lazy in fact that we like everything to be completely automated. Why go out and hand pick backlinks when you can write a script to do it for you? Why spend weeks perfecting a single page of ad copy when you can write a program to create thousands of variations in a few minutes? 

It is inevitable that someone brings up cookie stuffing or cross site scripting. Sure, there are black hats that chose to partake in such activities, but that doesn’t make them black hat activities. The majority of us you will find still have a solid moral compass and don’t cross certain lines. 

I suppose my main point here is to point out how silly the distinction between black and white really is. It isn’t as clear cut as the titles imply. Hopefully that leaves you with a little bit to think about.

The Great Duplicate Content Myth

August 5th, 2008

Yesterday we discussed the HOW portion of detecting duplicate content. Today I want to get into the actual process itself.

A wide spread Theory in the SEO world states that duplicate content not only carries a heavy penalty, but in fact can and will lead to a domain being banned or deindexed. Today I am going to discuss why I believe that this is not only unfounded, but perhaps completely untrue.

Lets start with some facts and figures. I’ve had the pleasure of reading dozens of research papers from msn, yahoo, google, and other leading members of the academic and professional search arena. From these papers it’s easy to determine that duplicate content detection is entirely possible in theory and at least partly in practice, but I believe the “practice” portion is where almost everyone may be wrong.

So what would it take for the big G to pull off duplicate content testing in the real world? Well, lets start by looking at the numbers. Lets assume it’s still 2004 and google still has “only” 8 billion pages in their index. Estimates show that they have several PETABYTES of data across their datacenters. So i’m joe webmaster and I put up a page about sprinklers. Does anyone here really believe that Google or anyone else on this planet actually has enough computer processing power to take my single page about sprinklers, shingle it and compare it to their other 7,999,999,999 pages of content each of which needs to be shingled as well? Shingling as we discussed yesterday, is the process by which search engines determine unique content from duplicate content. Of course, you do have the problem of it being a very intensive calculation because you’re not comparing A->B you’re comparing every document against all other documents.  I think they call this a O(n2) problem.  and it happens to be a very expensive process cpu time wise. Unless a page is flagged to begin with, it would be cost and time prohibitive to carry out such an expensive calculation on every page in their data set.

So if this is the case, what is duplicate content used for? What is the scope of the data google is looking for? I believe they check for duplicate content on a PER DOMAIN BASIS, meaning they take a single domain, check the content and run comparisons to give the overall domain a content quality or duplicate content quality score. Lets see why that makes sense on several levels. First, it’s within the ability of their crawler to do such a thing from a cpu processing power perspective, it also makes sense that they would factor this into the overall quality score for a domain.

Now the evidence:

1) A year ago I put up a 100 percent clone of wikipedia. I used the wikipedia template, I copied the data from their database, etc. This new domain was 100 percent identical to that of wikipedia.com.

The result? I rank well for thousands of terms, the domain has almost 1 million pages indexed in google, and it receives 3-5K uniques per day. So much for a duplicate content penalty. Of course the content is highly unique from page to page on the domain, but it isn’t unique when the scope is expanded to include the entire internet.

2) PublicBlend.com - By definition all social media sites contain 100 percent duplicate content that would never pass a shingling algorithm. All of our stories come directly from other web pages. In fact they are direct copies of articles from all over the internet.

The result? PublicBlend.com has been steadily growing in search engine traffic every month and now receives over 3,000 uniques a day from google alone. (we recently changed the domain name, so the indexing has started over)

3) News sites, not just social media, but regular news media as well. Reuters is the source for 90 percent of the news on the net. Everyone duplicates their stories word for word yet they all rank well for the resulting stories.

I hope the above sparks some debate and discussion on the topic of duplicate content. It may also raise some other interesting questions:

From a white hat perspective, what happens when 50 spam sites scrape your feed?  Will your content get penalized or will the spam sites get penalized? How would a search engine determine who wrote the article first? Would they simply rely on domain trust? If so that opens the door to all sorts of gaming options using old trusted domains.