Archive

Archive for the ‘Search Engine Series’ Category

Do Patents Point to SEO Gold?

June 9th, 2009

Do patents, white papers, and other publications authored by search engine employees provide clear guidance to how to optimize Web pages? A panel of experts debated the issue at a recent Search Engine Strategies conference.

Software engineers and other staff at the commercial web search engines publish academic papers and apply for patents, which may or may not give proof about how search engines find and rank web pages. Dissecting whether understanding patents helps search optimizers were panelists Jon Glick, Senior Director of Product Search and Comparison Shopping at Become.com; Rand Fishkin, CEO of SEOmoz.org; and Bill Slawski, President of SEO by the Sea, Inc.
Search engine patents: proof or no proof
Many search engine optimizers regularly monitor search engine patent applications and use this documentation as proof that their methodologies help web pages rank. However, patent applications often offer limited, and even misleading, information.

“What search engines put into patents is often more like brainstorming,” said Glick. “It’s every approach that they can think of versus what they are actually doing, or even have a technology to do.”

Search engine staff file patents with the idea that they might use certain features in the future yet prevent their competitors from utilizing the same features. “Search engine staff know that their patent applications will be read by competitors and SEOs,” he continued. “You don’t actually have to use the features in the patent to be granted a patent, nor does anyone have to disclose all features in a patent application.”

For example, personal data is not likely to be used as part of a search engine algorithm. Many people might use the same computer (such as computers in libraries, universities, and Internet cafes); therefore, the personal data is often inaccurate and does little to enhance the search experience. Nonetheless, this information can be a part of the patent application.

“People should realize that looking at patents and white papers might describe things that never happen,” echoed Slawski.

However, some items in a patent application can be useful, such as the frequency of change of links, or evaluation of out-links. Web site owners do not have control over how other sites link to their site, but they do have complete control over their content and the sites they choose to link to. “Traditionally, search engines have ranked a web site based on who links to the site, not who the site link to,” said Glick. Both Google and Yahoo! use out-links for spam evaluation, and next-generation algorithms are using them.”

According to Fishkin, search engines recognize manipulative link-building techniques by looking at links and link flow.

“Manipulative links are built for search engines, not (human) users,” said Fishkin. “They are built automatically rather than by hand. They are not an editorial vote for the quality of a page and are influenced by financial or less ‘legitimate’ incentives.”

Search engines use algorithmic techniques for identifying and combating manipulative links. Some of these techniques might include:

* Spotting link networks
* Similarity identification
* Trends and search data evaluation
* web analytics and user surfing data

“If you’re concerned about privacy,” added Fishkin, “you have to question where the data from Google Analytics goes.”

Even expert SEOs can be easily confused with patent information. For example, some SEOs believe that having an RSS feed will automatically give a site a boost in rankings. However, if a site has an RSS feed, it might be crawled more frequently because the site is likely to have fresh content. “The rate of change in content mostly impacts crawl frequency, not ranking,” said Glick.
Evolution of search engine algorithms
Search engine algorithms are constantly evolving. “Search algorithms are getting better and better at understanding what the content on pages actually means,” Glick stated. “A few years ago they were just blindly indexing the words on a web page, but now they are beginning to understand what some of those words mean and what the page represents (store, news article, etc.) For example, (650) 555-1212 is a phone number.”

Slawski sees search engine algorithms evolving in stages. Stage 1 was a “one size fits all approach,” which, as Glick mentioned, was not very effective.

Stage 2 algorithms developed through understanding users. “Search engines are looking more into search query data, which involves analyzing search queries, collecting searcher information, and matching searcher intentions,” Slawski said. “With Stage 3, search engines are taking a step forward, not only looking at interactions but at people themselves.”

Should SEOs regularly monitor patent applications, white papers, and other publications that are authored by search engine software engineers and scientists? Absolutely. Search engines constantly try to improve the search experience, and information provided in these documents can help web site owners improve the search experience on their own sites. However, realize that patent information might not always offer the solid “proof” of an algorithm that one might believe.

Microsoft launching new search engine Bing (logo leaked)

May 26th, 2009

Within the next few days, Microsoft is expected to unveil its latest attempt at trying to be a player in the world of web search. After it has failed to get live.com any traction against Google, it will apparently launch a new engine called “Bing” — the project formerly known by its working title “Kumo.” This should be unveiled at the D conference which starts today in Carlsbad, CA — but it looks like Microsoft may be giving us a peak at the logo a tad early.12

While it appears that Microsoft may have already taken it down, I visited bing.com in my browser about 10 minutes ago and sure enough saw the favicon you see above. It’s a lowercase “b” with a yellow/orange dot in the middle. It would appear that this will be at least a part of the Bing logo. The light blue and yellow/orange color combination matches that of Kumo. I find that combination to be quite ugly — sort of like the Cleveland Cavaliers basketball uniforms (below) from the 1990s — but hey, that’s just personal taste. All that really matters is now the search engine actually performs.

This favicon, which again, may only be a part of the logo, also looks a lot like the logo for Blinkx, the video search engine. That features a red lowercase “b” with an eye in the middle.

Microsoft is spending some $80 to $100 million on a marketing campaign for Bing, according to Ad Age. That’s huge by any standard, but especially when you consider that Google only spent $25 million on all of its marketing last year. I don’t know what Microsoft plans to spend all that money on, but I get the sneaking suspicion that Bing Crosby will be involved in some way or another.

What Is Google Squared? It Is How Google Will Crush Wolfram Alpha (Video)

May 12th, 2009

google-squared-labsOne of the next frontiers of search is taking all of the unstructured data spread helter-skelter across the Web and treat it like it is sitting in a nice, structured database. It is easier to get answers out of a database where everything is neatly labeled, stamped, and categorized. As the sheer volume of stuff on the Web keeps growing, keyword search keeps getting closer to its breaking point. Adding structure to the Web is one way to make sense of all that data, and Google is starting the tackle the problem with a Google Labs project called Google Squared, which Marissa Mayer mentioned earlier today at the company’s Searchology briefing.

Google Squared extracts data from Web pages and presents them in search results as squares in an online spreadsheet. Michael was at the event and got a personal demo (see video below). From Michael’s Searchology notes:

Google Squared is launching later this month in labs. Google Squared returns search results in a spreadsheet format. It structures the unstructured data on web pages. So a search for Small Dogs returns results with names, description, size, weight, origin, etc., in columns and rows.

Google is looking for data structures on the web that imply facts, and then grabbing it for Squared results. “It takes an incredible amount of compute power to create one of those squares,” she says.

This type of technology has obvious applications for many types of targeted searches, including product search, health search, scientific searches, you name it. There are dozens of semantic search startups trying to impose structure on the Web to perform similar tricks. Another high-profile search startup which is launching on Monday, Wolfram Alpha, takes a slightly different approach in that it simply ingests massive amounts of information into its own databases where it can query it to its heart’s delight. Already there is a bit of a rivalry between Google and Wolfram because getting back structured results is a major new direction for search.

Wolfram does a pretty good job parsing the information in its own databases, but those databases will never match what is available on the Web. Wolfram’s databases currently store only 10 terabytes of information, a tiny fraction of what is on the Web. (I will be posting my impressions of Wolfram’s search engine soon). Google Squared is an early attempt to take the messy data which exists on the Web and place it into simple tables. It is still very experimental and isn’t always on target, but you can see where this is going. Turning the Web into a giant database will crush any attempt to segregate the “best” information into a separate database so that it can be processed and searched more deeply.

In the video demo below, a search for “camera” sorts the results in different columns by images, description, and manufacturer, resolution, etc.. You can refine results by clicking on a particular column such as manufacturer. A search for “rollercoasters” sorts results by name, image, description, height, length, and number of inversions. But sometimes it gets confused. A search for “spaceships” turns up a Corvette and a missile carrier. It is going to be a while before this makes it out of Google Labs

Does Domain Age Influence Ranking?

May 7th, 2009

The order that pages appear in the results of a search at a search engine may be influenced by the number of pages that link to that page, and by rankings of the pages that link to that page.

When a site is linked to by a popular and trusted domain, that link might provide more value (and a higher ranking) than a link from a site that is less popular and trusted.

Ages of Linking Domains

A new patent application from Microsoft adds another twist, by also ranking domains based upon the ages of domains which link to those domains.

Why?

The cost of purchasing a domain has decreased significantly in recent years, and some domain registrars have offered free domain registrations for up to thirty to sixty day trial periods.

A spammer might take advantage of an offer like that to build something known as a link farm, which is a spam technique in which spammers “purchase or otherwise obtain a large number of sites and interlink the sites together to increase the sites’ rankings by artificially increasing the number of contributing domains for some or all of the sites.”

The Microsoft patent application is:

Ranking Domains Using Domain Maturity
Invented by Janine Crumb, Krishna C Gade, Rangan Majumder, Vishnu Challam
Assigned to Microsoft
US Patent Application 20080086467
Published April 10, 2008
Filed October 10, 2006

Abstract

Ranking domains for search engines is provided herein. To rank a domain, contributing domains associated with the domain are identified. Additionally, the maturity of each of the contributing domains is determined.

A rank for the domain is then determined based at least in part on the maturity of each of the contributing domains. The domain rankings may then be used to order results for search queries.

This patent application assumes that newer domains have a “higher likelihood of being spam and/or being a part of a web farm that attempts to artificially inflate domain rankings for domains in the web farm.”

By looking at the age of domains that link to those newer domains when determining a rank for a domain, domains which have links from older domains “may be ranked higher than spam domains and/or less relevant domains.”

Maturity and Immaturity of Contributing Domains

A search engine may access domain information by communicating with the web servers that those are hosted upon, to access and/or update domain information, such as domain registration date, domain expiration date, domain swapping date(s), and a set of linked domains.

The maturity of a contributing domain may be based upon when that domain was registered or was first discovered by a search engine (if the domain information doesn’t provide a registration date).

Maturity may mean labeling a domain as mature immature. For example, contributing domains registered more than a year ago could be considered mature domains.

Ranking based upon the age of contributing domain could involve looking at:

1) Mature Domains only — A domain’s rank might be calculated based in part on only mature contributing domains that are associated with the domain.

2) Mature and Immature Domains — rankings might be influenced by both mature and immature domains, but the value of the rank for the immature domains might be based upon the ranks of the mature domains linking to those immature domains.

While some new domains can be spam, not all are. New domains that are popular, provide value, and gain links from older domains could be allowed to pass along the rankings from the mature domains associated with those new domains.

3) Instead of distinquishing between domains linking to a domain as either a mature or immature, the age of contributing (linking) domain might be used to provide a percentage of ranking to a domain:

For example, in an embodiment, domains that have been registered for more than ten years may contribute 100% of their accumulated ranks to a target domain’s rank;

domains that have been registered from six to ten years may contribute 75% of their accumulated ranks to a target domain’s rank;

domains that have been registered from three to six years may contribute 50% of their accumulated ranks to a target domain’s rank;

domains that have been registered for one to three years may contribute 25% of their accumulated ranks to a target domain’s rank; and

domains that have been registered for less than one year may only contribute 10% of their accumulated ranks.

Resetting Maturity for Expired or Swapped Domains

The maturity of a domain might be reset if the domain expires or if the domain is swapped.

It’s possible for spammers to buy a block of domains that have expired as well as new domains to form a Web Farm. By a search engine resetting the maturity of a domain, spammers don’t benefit from the purchase or swapping of an older domain.

Conclusion

The effect of a process like this might make it look like new domains are being penalized by search engines because they are new (what someone might perhaps call something like a “sandbox” effect).

If a process like this were in place, it might cause new domains that aren’t linked to by older domains to not rank highly, at least until they get some links from older domains.

Do Search Engines Love Blogs?

May 5th, 2009

Microsoft Explores an Algorithm to Increase PageRank for Pages Linked to by Blogs.

In the new patent document, they ask if the rankings of web pages in search results would be improved by a providing a slight increase in the PageRank of pages linked to by blogs. They tell us that:

This idea is based on the assumption (or hope) that blogs are still mostly human-authored, and that links from blogs generally represent sincere endorsements on the part of the authors.

 

The December post explored how a search engine might be able to identify blog pages and distinquish them from non blog pages, and told us that:

Search engines are increasingly implementing features that restrict the results for queries to be from blog pages.

But limiting the number of blogs that show up in search results doesn’t necessarily mean that a search engine doesn’t like blogs. It may mean that search engines would prefer to show a diversified set of search results, including blog pages and other results.

Ranking Algorithms

Search engines often look a couple of different kinds of ranking factors when determining the order that search results are shown to searchers.

Query-Independent and Query-Dependent

One way to classify ranking algorithms is query-dependent (or dynamic) or query-independent (or static).

Query-dependent ranking algorithms rely upon the query terms someone uses to rank pages, while query-independent look at other factors such as how important they may believe a page to be based upon things such as whether or not important pages link to that page (an example of a query-independent ranking algorithm would be PageRank).

Query-independent ranking algorithms assign a quality score to each document on the web, and can be run ahead of time. Query-dependent ranking algorithms depend upon the query used, and have to be run when a user submits a query.

Content, Usage, and Link Based Ranking Algorithms

It’s also possible to classify ranking algorithms as content-based, usage-based, and link-based.

Content-based ranking algorithms - use the words in a document to rank the document among other documents. For instance, a higher score might be assigned to a document that contains the query terms at the beginning of a document, in a prominent font, or in a certain kind of HTML element.

Usage-based ranking algorithms - may assign a score based on estimages of how often documents are viewed from looking at web proxy logs or looking at click-throughs on search engine results pages.

Link-based ranking algorithms - look at the hyperlinks between web pages to rank those pages, assigning a score to pages based upon links pointing to pages. endorsement of the page.

PageRank - an example of a query-independent link-based ranking algorithm.

The PageRank formula is often explained as follows. Consider a web surfer who is performing a random walk on the web. At every step along the walk, the surfer moves from one web page to another, using the following algorithm.

With some probability d, the surfer selects a web page uniformly at random and jumps to it; otherwise, the surfer selects one of the outgoing hyperlinks in the current page uniformly at random and follows it. Because of this metaphor, the number d is sometimes called the “jump probability,” namely the probability that the surfer will jump to a completely random page.

If the web surfer jumps with probability d and there are |V| web pages, the probability of jumping to a particular page is d/|V|. Since any page can be reached by jumping, every page is guaranteed a score of at least d/|V|. The PageRank of a particular web page is then the fraction of time that the random surfer will spend at that page.

But what if that surfer started favoring pages that were linked to by blogs a little more?

Splitting PageRank

One of the problems behind using PageRank is that some commercial web sites try to inflate PageRank by creating links that point to a page solely for the purpose of endorsing that page, artificially increasing the value of the page.

This patent filing describes in some detail how a portion of PageRank from a page might be split (or distributed) equally amongst the links found on the pages of a site, and how the distribution of PageRank could be slightly altered to favor (or show a bias towards) pages that are linked to by blogs.

If blogs are, as the authors note in the patent, “still mostly human authored, and generally represent sincere endorsements of their authors,” then this bias might help counteract the artifical inflation of PageRank scores by people who would create links pointing to pages solely for the purpose of artifically increasing the PageRank of pages.

The patent filing is:

Ranking Method using Hyperlinks in Blogs
Inventors: Steve Chien and Dennis Fetterly
Assigned to Microsoft
US Patent Application 20080243812
Published October 2, 2008
Filed March 30, 2007

Abstract

A method for static ranking of web documents is disclosed. Search engines are typically configured such that search results having a higher PageRank.RTM. score are listed first. A modified scoring technique is provided whereby the score includes a reset vector that is biased toward web pages linked to blogs. This requires identifying web pages as either blogs or non-blogs.

Identifying Blogs

Some of the kinds of things that a search engine crawling program might look at when deciding whether a page is from a blog might include:

  1. Whether a page is hosted in a known blog hosting DNS domain such as blogspot or wordpress.com
  2. What features are containted in the non-HTML markup words and phrases contained in the page
  3. What the targets of outgoing links might be in the page, and
  4. Whether the string “blog” occurs in the URL

Experimenting with a Bias Towards Pages Linked to by Blogs

The authors of this patent performed experiments where they downloaded over 472 million pages, and found links to an additional 6 Billion pages within those pages.

They reranked the PageRank of these pages using a bias towards pages that they identified were linked to by blogs, with a preference towards using blog pages that had higher PageRanks, which they tell us tend to be “frequently updated, more informational rather than personal, and free of spam.”

They also tell us that some other characteristics of blogs may prove useful in refining this technique, such as looking at the number of subscribers to a particular blog, and associating a higher endorsement value to blogs with greater numbers of subscribers.

Conclusion

Can sending more PageRank to pages that are linked to by blogs something that will increase the relevance and importance of pages that show up in search results? Are links to pages from blogs still actual endorsements from the authors of those blogs?

Do search engines love blogs?

Search Engine Series: Indications of Web Spam

May 4th, 2009

A patent application from Microsoft looks at content generated to spam search engines. Here’s the problem, as noted in the patent filing:

In the best case, search engine optimizers help web site designers generate content that is well-structured, topical, and rich in relevant keywords or query terms. Unfortunately, some search engine optimizers go well beyond producing relevant pages: they try to boost the ratings of a web site by loading pages with a wide variety of popular query terms, whether relevant or not. In fact, some SEOs go one step further: Instead of manually creating pages that include unrelated but popular query terms, they machine-generate many such pages, each of which contains some monetizable keywords (i.e., keywords that have a high advertising value, such as the name of a pharmaceutical, credit cards, mortgages, etc.). Many small endorsements from these machine-generated pages result in a sizable page rank for the target page. In a further escalation, SEOs have started to set up DNS servers that will resolve any host name within their domain, and typically map it to a single IP address.

Most if not all of the SEO-generated pages exist solely to mislead a search engine into directing traffic towards the “optimized” site; in other words, the SEO-generated pages are intended only for the search engine, and are completely useless to human visitors.

I recognized this quote, which is taken from an interesting research paper from Microsoft, Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages. If you are interested in how search engines are attempting to fight web spam, it’s a “must read” paper.

 

It appears that this patent is an attempt to take some of the research reported upon in that paper, and define a way to use it in a process that can help the search engine fight web spam. But, it isn’t a rehashing of that paper, and it covers some new territory. Definitely worth a look, especially if you are concerned that your pages may be mistaken for spam by the search engines.

Using content analysis to detect spam web pages
Inventors: Marc Alexander Najork, Dennis Craig Fetterly, Mark Steven Manasse, and Alexandros Ntoulas
Assigned to Microsoft
US Patent Application 20060184500
Published August 17, 2006
Filed: February 11, 2005

Abstract

Evaluating content includes receiving content, analyzing the content for web spam using a content-based identification technique, and classifying the content according to the analysis. An index of analyzed contents may be created. A system for evaluating content includes a storage device configured to store data and a processor configured to analyze content using content-based identification techniques to determine whether web spam is present.

The patent describes some measures that the authors may be looking at when viewing the content of a page to determine whether or not the page is intended only to spam a search engine. The authors note that other steps and other metrics may also be involved.

Classification of Content

Metrics about pages are collected and fed into a classifier program which uses weighted scores to distinquish good pages from bad ones. The classifier program starts with an initial data set, called the training set, which is divided into positive and negative examples. That training set looks at all of the features of the positive and negative examples in combination, in an attempt to separate the positive examples (non-spam) from the negative examples (spam).

Using a classifier like this may mean that once the dividing line is made, additional data may be looked at to see if it can be used to distinquish good pages from bad ones. We know from the “Spam, Damn Spam, and Statistics” paper that Microsoft is also looking at other features of pages and sites.

According to the patent filing, some classes of spam web pages can be detected by analyzing the content of the page and looking for “unusual” properties, such as:

  • The page contains unusually many words,
  • The page contains unusually many words within a title HTML element (<title>here!</title>)
  • The ratio of HTML markup to visible text is low,
  • The page contains an unusually large number of very long or very short words,
  • The page contains repetitive content,
  • The page contains unusually few common words (”stop words”), or
  • The page contains a larger-than-expected number of popular n-grams (sequences of n words)

These metrics or filters can be input into a classifier for deciding whether or not a page is spam or determining the likelihood or probability that the page is spam, by comparing the outputs of one or more of the metrics, alone or in combination, to one or more thresholds.

The patent mentions an example reference book which describes the existing body of work in machine learning: Pattern Classification (my link doesn’t go to the book itself, but rather to a page from one of the authors, which has a great series of powerpoint slides about material in the book).

Identifying Web Spam on the Fly

The patent describes methods for finding spam pages during web crawls and or evaluating content on the fly.

Here’s a summary of the process for identifying spam through content, on the fly, from the patent application:

  1. Search engine receives user input to begin a particular query,
  2. Search engine performs the query,
  3. Search engine receives the query results,
  4. Search engine (or processor or classifier, for example) evaluates the results using various metrics,
  5. After evaluation the search engine analyzes the evaluations to determine what contents are likely web spam.
  6. From that analysis, the search engine may identify web pages as web spam and may record or store the contents in an index for future queries,
  7. Query results are then output to the searcher.
  8. Detected web spam could excluded from a search engine index, given a low search ranking, or treated in a manner so that user queries are not affected or populated with web spam, which could lead to more relevant search results, or at least the omission of some irrelevant results.

Indications of Web Spam?

The list above of some “unusual properties” that may be looked for is examined in greater detail within the patent application. The following are paraphrases of some of those and some additional metrics. I’d recommend looking at the patent for their more detailed treatment of these. Keep in mind that many are just one factor to be looked at in conjunction with the others before a determination is made that a page is intended to spam a search engine.

1. As the number of words on a page increases, the probability of spam being present on that page increases.

2. As the number of words in the title of a web page increases, the probability of web spam being present dramatically increases.

3. As the visible content of the page increases, the probability of web spam being present increases to a point and then decreases dramatically.

4. As the fraction of anchor words increases (as a percentage of all the words on a page), the probability of web spam increases.

5. Web spam is more likely to occur in web pages having very long or very short words, so an average word length metric can be used to identify spam pages.

6. As the zipRatio of a page increases beyond a threshold, the probably of web spam being present on a web pages increases dramatically. A zip ratio is calculated by dividing the size (in bytes) of uncompressed visible text (such as text other than HTML markup) by the size (in bytes) of compressed visible text.

7. As a percentage (and distribution) of stop words (the most commonly used words in a search engine corpus) used on a page deceases, the probablility of web spam increases.

For example, the 100 most common words in a very large corpus representative of the English language is determined, e.g., by examining all the English web pages downloaded by the crawler (the same applies to other languages as well). It is then determined what fraction of the words on a single web page is drawn from the 100 most frequent words in the entire corpus. For example, words like “the”, “a”, “from”, etc. are among the 100 most frequent English words. If a web page had no occurrences of any of these words, but 100 occurrences of “echidna” (a spiny anteater and a rare word), it is determined that the page has 0% overlap with the top-100 words.

8. Pages are also reviewed for the existence of commonly ocurring sequences of consecutive words (n-grams), their position within a document, and commonly ocurring words that may appear after those sequences. Probabilities of those are calculated from documents on the web, and thresholds are defined which could be used to determine whether or not a page should be identified as web spam.