Archive

Posts Tagged ‘seo’

Do Patents Point to SEO Gold?

June 9th, 2009

Do patents, white papers, and other publications authored by search engine employees provide clear guidance to how to optimize Web pages? A panel of experts debated the issue at a recent Search Engine Strategies conference.

Software engineers and other staff at the commercial web search engines publish academic papers and apply for patents, which may or may not give proof about how search engines find and rank web pages. Dissecting whether understanding patents helps search optimizers were panelists Jon Glick, Senior Director of Product Search and Comparison Shopping at Become.com; Rand Fishkin, CEO of SEOmoz.org; and Bill Slawski, President of SEO by the Sea, Inc.
Search engine patents: proof or no proof
Many search engine optimizers regularly monitor search engine patent applications and use this documentation as proof that their methodologies help web pages rank. However, patent applications often offer limited, and even misleading, information.

“What search engines put into patents is often more like brainstorming,” said Glick. “It’s every approach that they can think of versus what they are actually doing, or even have a technology to do.”

Search engine staff file patents with the idea that they might use certain features in the future yet prevent their competitors from utilizing the same features. “Search engine staff know that their patent applications will be read by competitors and SEOs,” he continued. “You don’t actually have to use the features in the patent to be granted a patent, nor does anyone have to disclose all features in a patent application.”

For example, personal data is not likely to be used as part of a search engine algorithm. Many people might use the same computer (such as computers in libraries, universities, and Internet cafes); therefore, the personal data is often inaccurate and does little to enhance the search experience. Nonetheless, this information can be a part of the patent application.

“People should realize that looking at patents and white papers might describe things that never happen,” echoed Slawski.

However, some items in a patent application can be useful, such as the frequency of change of links, or evaluation of out-links. Web site owners do not have control over how other sites link to their site, but they do have complete control over their content and the sites they choose to link to. “Traditionally, search engines have ranked a web site based on who links to the site, not who the site link to,” said Glick. Both Google and Yahoo! use out-links for spam evaluation, and next-generation algorithms are using them.”

According to Fishkin, search engines recognize manipulative link-building techniques by looking at links and link flow.

“Manipulative links are built for search engines, not (human) users,” said Fishkin. “They are built automatically rather than by hand. They are not an editorial vote for the quality of a page and are influenced by financial or less ‘legitimate’ incentives.”

Search engines use algorithmic techniques for identifying and combating manipulative links. Some of these techniques might include:

* Spotting link networks
* Similarity identification
* Trends and search data evaluation
* web analytics and user surfing data

“If you’re concerned about privacy,” added Fishkin, “you have to question where the data from Google Analytics goes.”

Even expert SEOs can be easily confused with patent information. For example, some SEOs believe that having an RSS feed will automatically give a site a boost in rankings. However, if a site has an RSS feed, it might be crawled more frequently because the site is likely to have fresh content. “The rate of change in content mostly impacts crawl frequency, not ranking,” said Glick.
Evolution of search engine algorithms
Search engine algorithms are constantly evolving. “Search algorithms are getting better and better at understanding what the content on pages actually means,” Glick stated. “A few years ago they were just blindly indexing the words on a web page, but now they are beginning to understand what some of those words mean and what the page represents (store, news article, etc.) For example, (650) 555-1212 is a phone number.”

Slawski sees search engine algorithms evolving in stages. Stage 1 was a “one size fits all approach,” which, as Glick mentioned, was not very effective.

Stage 2 algorithms developed through understanding users. “Search engines are looking more into search query data, which involves analyzing search queries, collecting searcher information, and matching searcher intentions,” Slawski said. “With Stage 3, search engines are taking a step forward, not only looking at interactions but at people themselves.”

Should SEOs regularly monitor patent applications, white papers, and other publications that are authored by search engine software engineers and scientists? Absolutely. Search engines constantly try to improve the search experience, and information provided in these documents can help web site owners improve the search experience on their own sites. However, realize that patent information might not always offer the solid “proof” of an algorithm that one might believe.

Link Exchange: A Case Study Part 2

May 8th, 2009

In case you missed it, this is a multi part series about Link Exchange. You can get caught up to speed on the details and setup here: Link Exchange

For the rest of you, lets dive into the numbers. This is now the 4th day of use. I am still running without a link throttle and averaging 16-18,000 link views per hour. Let’s take a look at the screenshot:

letraffic2

 I have of course removed the link ur’s and the keywords i’m targeting. We can see a tremendous number of google link views which really seems to be translating into increased 

indexing and higher rankings. The domain is now pretty much 100 percent indexed, and as the following image will show, the traffic has really increased. Rankings for nearly all of my target keywords are now showing up on the first page of Google with several in the number 1 position. Now, we all know Google is important, but hardly the only  game in town. Lets take a look at Yahoo and MSN/Live. Yahoo at the start of this was showing 0 pages indexed. We simply didn’t show up. As of today, we have 71 pages indexed and  show 506 incoming links. Quite impressive, and even more impressive is the fact that we now rank in Yahoo for two of our most important target keyword phrases. In fact, today we received our first human traffic from yahoo for our main target phrase.  I expected Yahoo to take several weeks, but the sheer linking power of LE seems to be reducing that expected lag. As for msn, the same starting scenario with us having 0 pages indexed just last week. Today, we see 29 indexed pages. Definitely an impressive start for only being 4 days in. 

Next we will take a look at the most important data. The analytics:

traffic2

As we can see from the graph, it’s quite the increase over our starting point. Recall that we were averaging 0-2 searches per day. Yesterday we had 98 incoming hits from search engines, and in fact we had several new members register as a result. 

That’s about all for today. Expect another update on Monday as we follow the progress :)

Does Domain Age Influence Ranking?

May 7th, 2009

The order that pages appear in the results of a search at a search engine may be influenced by the number of pages that link to that page, and by rankings of the pages that link to that page.

When a site is linked to by a popular and trusted domain, that link might provide more value (and a higher ranking) than a link from a site that is less popular and trusted.

Ages of Linking Domains

A new patent application from Microsoft adds another twist, by also ranking domains based upon the ages of domains which link to those domains.

Why?

The cost of purchasing a domain has decreased significantly in recent years, and some domain registrars have offered free domain registrations for up to thirty to sixty day trial periods.

A spammer might take advantage of an offer like that to build something known as a link farm, which is a spam technique in which spammers “purchase or otherwise obtain a large number of sites and interlink the sites together to increase the sites’ rankings by artificially increasing the number of contributing domains for some or all of the sites.”

The Microsoft patent application is:

Ranking Domains Using Domain Maturity
Invented by Janine Crumb, Krishna C Gade, Rangan Majumder, Vishnu Challam
Assigned to Microsoft
US Patent Application 20080086467
Published April 10, 2008
Filed October 10, 2006

Abstract

Ranking domains for search engines is provided herein. To rank a domain, contributing domains associated with the domain are identified. Additionally, the maturity of each of the contributing domains is determined.

A rank for the domain is then determined based at least in part on the maturity of each of the contributing domains. The domain rankings may then be used to order results for search queries.

This patent application assumes that newer domains have a “higher likelihood of being spam and/or being a part of a web farm that attempts to artificially inflate domain rankings for domains in the web farm.”

By looking at the age of domains that link to those newer domains when determining a rank for a domain, domains which have links from older domains “may be ranked higher than spam domains and/or less relevant domains.”

Maturity and Immaturity of Contributing Domains

A search engine may access domain information by communicating with the web servers that those are hosted upon, to access and/or update domain information, such as domain registration date, domain expiration date, domain swapping date(s), and a set of linked domains.

The maturity of a contributing domain may be based upon when that domain was registered or was first discovered by a search engine (if the domain information doesn’t provide a registration date).

Maturity may mean labeling a domain as mature immature. For example, contributing domains registered more than a year ago could be considered mature domains.

Ranking based upon the age of contributing domain could involve looking at:

1) Mature Domains only — A domain’s rank might be calculated based in part on only mature contributing domains that are associated with the domain.

2) Mature and Immature Domains — rankings might be influenced by both mature and immature domains, but the value of the rank for the immature domains might be based upon the ranks of the mature domains linking to those immature domains.

While some new domains can be spam, not all are. New domains that are popular, provide value, and gain links from older domains could be allowed to pass along the rankings from the mature domains associated with those new domains.

3) Instead of distinquishing between domains linking to a domain as either a mature or immature, the age of contributing (linking) domain might be used to provide a percentage of ranking to a domain:

For example, in an embodiment, domains that have been registered for more than ten years may contribute 100% of their accumulated ranks to a target domain’s rank;

domains that have been registered from six to ten years may contribute 75% of their accumulated ranks to a target domain’s rank;

domains that have been registered from three to six years may contribute 50% of their accumulated ranks to a target domain’s rank;

domains that have been registered for one to three years may contribute 25% of their accumulated ranks to a target domain’s rank; and

domains that have been registered for less than one year may only contribute 10% of their accumulated ranks.

Resetting Maturity for Expired or Swapped Domains

The maturity of a domain might be reset if the domain expires or if the domain is swapped.

It’s possible for spammers to buy a block of domains that have expired as well as new domains to form a Web Farm. By a search engine resetting the maturity of a domain, spammers don’t benefit from the purchase or swapping of an older domain.

Conclusion

The effect of a process like this might make it look like new domains are being penalized by search engines because they are new (what someone might perhaps call something like a “sandbox” effect).

If a process like this were in place, it might cause new domains that aren’t linked to by older domains to not rank highly, at least until they get some links from older domains.

The Great Duplicate Content Myth

August 5th, 2008

Yesterday we discussed the HOW portion of detecting duplicate content. Today I want to get into the actual process itself.

A wide spread Theory in the SEO world states that duplicate content not only carries a heavy penalty, but in fact can and will lead to a domain being banned or deindexed. Today I am going to discuss why I believe that this is not only unfounded, but perhaps completely untrue.

Lets start with some facts and figures. I’ve had the pleasure of reading dozens of research papers from msn, yahoo, google, and other leading members of the academic and professional search arena. From these papers it’s easy to determine that duplicate content detection is entirely possible in theory and at least partly in practice, but I believe the “practice” portion is where almost everyone may be wrong.

So what would it take for the big G to pull off duplicate content testing in the real world? Well, lets start by looking at the numbers. Lets assume it’s still 2004 and google still has “only” 8 billion pages in their index. Estimates show that they have several PETABYTES of data across their datacenters. So i’m joe webmaster and I put up a page about sprinklers. Does anyone here really believe that Google or anyone else on this planet actually has enough computer processing power to take my single page about sprinklers, shingle it and compare it to their other 7,999,999,999 pages of content each of which needs to be shingled as well? Shingling as we discussed yesterday, is the process by which search engines determine unique content from duplicate content. Of course, you do have the problem of it being a very intensive calculation because you’re not comparing A->B you’re comparing every document against all other documents.  I think they call this a O(n2) problem.  and it happens to be a very expensive process cpu time wise. Unless a page is flagged to begin with, it would be cost and time prohibitive to carry out such an expensive calculation on every page in their data set.

So if this is the case, what is duplicate content used for? What is the scope of the data google is looking for? I believe they check for duplicate content on a PER DOMAIN BASIS, meaning they take a single domain, check the content and run comparisons to give the overall domain a content quality or duplicate content quality score. Lets see why that makes sense on several levels. First, it’s within the ability of their crawler to do such a thing from a cpu processing power perspective, it also makes sense that they would factor this into the overall quality score for a domain.

Now the evidence:

1) A year ago I put up a 100 percent clone of wikipedia. I used the wikipedia template, I copied the data from their database, etc. This new domain was 100 percent identical to that of wikipedia.com.

The result? I rank well for thousands of terms, the domain has almost 1 million pages indexed in google, and it receives 3-5K uniques per day. So much for a duplicate content penalty. Of course the content is highly unique from page to page on the domain, but it isn’t unique when the scope is expanded to include the entire internet.

2) PublicBlend.com - By definition all social media sites contain 100 percent duplicate content that would never pass a shingling algorithm. All of our stories come directly from other web pages. In fact they are direct copies of articles from all over the internet.

The result? PublicBlend.com has been steadily growing in search engine traffic every month and now receives over 3,000 uniques a day from google alone. (we recently changed the domain name, so the indexing has started over)

3) News sites, not just social media, but regular news media as well. Reuters is the source for 90 percent of the news on the net. Everyone duplicates their stories word for word yet they all rank well for the resulting stories.

I hope the above sparks some debate and discussion on the topic of duplicate content. It may also raise some other interesting questions:

From a white hat perspective, what happens when 50 spam sites scrape your feed?  Will your content get penalized or will the spam sites get penalized? How would a search engine determine who wrote the article first? Would they simply rely on domain trust? If so that opens the door to all sorts of gaming options using old trusted domains.

Welcome to BlackHat360.com

August 2nd, 2008

BlackHat360 is a site dedicated to all things BlackHat. For those joining us that don’t know, blackhat is a type of SEO or Search Engine Optimization that is often misunderstood. We are here to dispell some of the myths surrounding the technique by educating people on the various methods and practices commonly used. We’re just starting out, so bear with us while we bring you new information and tools. Be sure to stop by our forums for in depth discussions and related information. BlackHat360 Forums