viagra quadriplegicsviagra kick inviagra 25mg side effectsviagra tipsviagra 100mg reviewviagra zonder receptviagra vs cialis priceviagra light switchviagra doesn't workviagra dangersviagra lastviagra theme songviagra los angelesviagra nitric oxideviagra nitratesviagra original useviagra 2011 salesviagra otcviagra youngviagra for womenviagra online prescriptionviagra usage tipsviagra not workingviagra side effectsviagra levitra cialisviagra ingredientsviagra q&aviagra low blood pressureviagra or cialisviagra use in womenviagra 100mgviagra professionalviagra gumviagra triangleviagra headacheviagra ringviagra vs. birth controlviagra youtube channelviagra in the waterviagra without a rxviagra generic onlineviagra factsviagra quick deliveryviagra recommended dosageviagra genericviagra useviagra generic dateviagra samplesviagra and foodviagra fallsviagra q and aviagra young ageviagra japanviagra expirationviagra dosageviagra by mailviagra indiaviagra questions and answersviagra with dapoxetineviagra with alcoholviagraviagra shelf lifeviagra last longerviagra young peopleviagra zagrebviagra blue visionviagra how long does it lastviagra online canadaviagra like drugsviagra canadaviagra videoviagra from indiaviagra musicviagra dosesviagra trialviagra contraindicationsviagra kenyaviagra usaviagra jingleviagra without edviagra zoloft interactionviagra make you last longerviagra red faceviagra las vegasviagra drug interactionsviagra questionnaireviagra hearing lossviagra long term effectsviagra and ecstacyviagra versus cialisviagra kopenviagra erowidviagra super forceviagra kaiser permanenteviagra condomviagra substituteviagra and alcoholviagra headquartersviagra lawsuitviagra 30sviagra instructionsviagra interactionsviagra young menviagra use directionsviagra vasodilatorviagra free samplesviagra discount couponviagra kidsviagra paypalviagra questionsviagra movieviagra vs revatioviagra blogviagra prescriptionviagra blindnessviagra ukviagra reviewsviagra going genericviagra how does it workviagra kaufenviagra cialisviagra benefitsviagra in womenviagra pfizerviagra email virusviagra forumviagra under tongueviagra resultsviagra definitionviagra online

Archive

Archive for August 4th, 2008

Duplicate Content Dissected

August 4th, 2008

I’ve read seemingly hundreds of forum posts discussing duplicate content, none of which gave the full picture, leaving me with more questions than answers. I decided to spend some time doing research to find out exactly what goes on behind the scenes. Here is what I have discovered.

Most people are under the assumption that duplicate content is looked at on the page level when in fact it is far more complex than that. Simply saying that “by changing 25 percent of the text on a page it is no longer duplicate content” is not a true or accurate statement. Lets examine why that is.

To gain some understanding we need to take a look at the k-shingle algorithm that may or may not be in use by the major search engines (my money is that it is in use). I’ve seen the following used as an example so lets use it here as well.

Let’s suppose that you have a page that contains the following text:

The swift brown fox jumped over the lazy dog.

Before we get to this point the search engine has already stripped all tags and html from the page leaving just this plain text behind for us to take a look at.

The shingling algorithm essentially finds word groups within a body of text in order to determine the uniqueness of the text. The first thing they do is strip out all stop words like and, the, of, to. They also strip out all fill words, leaving us only with action words which are considered the core of the content. Once this is done the following “shingles” are created from the above text. (i’m going to include the stop words for simplicity)

The swift brown fox
swift brown fox jumped
brown fox jumped over
fox jumped over the
jumped over the lazy
over the lazy dog

These are essentially like unique fingerprints that identify this block of text. The search engine can now compare this “fingerprint” to other pages in an attempt to find duplicate content. As duplicates are found a “duplicate content” score is assigned to the page. If too many “fingerprints” match other documents the score becomes high enough that the search engines flag the page as duplicate content thus sending it to supplemental hell or worse deleting it from their index completely.
My old lady swears that she saw the lazy dog jump over the swift brown fox.

The above gives us the following shingles.
my old lady swears
old lady swears that
lady swears that she
swears that she saw
that she saw the

she saw the lazy
saw the lazy dog
the lazy dog jump
lazy dog jump over
dog jump over the
jump over the swift
over the swift brown
the swift brown fox

Comparing these two sets of shingles we can see that only one matches (”the swift brown fox“). Thus it is unlikely that these two documents are duplicates of one another. No one but google knows what the percentage match must be for these two documents to be considered duplicates, but some thorough testing would sure narrow it down ;).

So what can we take away from the above examples? First and foremost we quickly begin to realize that duplicate content is far more difficult than saying “document A and document B are 50 percent similar”. Second we can see that people adding “stop words” and “filler words” to avoid duplicate content are largely wasting their time. It’s the “action” words that should be the focus. Changing action words without altering the meaning of a body of text may very well be enough to get past these algorithms. Then again there may be other mechanisms at work that we can’t yet see rendering that impossible as well. I suggest experimenting and finding what works for you in your situation.