payday loans car Insurance

Archive

Author Archive

Link Exchange: A Case Study Part 2

May 8th, 2009

In case you missed it, this is a multi part series about Link Exchange. You can get caught up to speed on the details and setup here: Link Exchange

For the rest of you, lets dive into the numbers. This is now the 4th day of use. I am still running without a link throttle and averaging 16-18,000 link views per hour. Let’s take a look at the screenshot:

letraffic2

 I have of course removed the link ur’s and the keywords i’m targeting. We can see a tremendous number of google link views which really seems to be translating into increased 

indexing and higher rankings. The domain is now pretty much 100 percent indexed, and as the following image will show, the traffic has really increased. Rankings for nearly all of my target keywords are now showing up on the first page of Google with several in the number 1 position. Now, we all know Google is important, but hardly the only  game in town. Lets take a look at Yahoo and MSN/Live. Yahoo at the start of this was showing 0 pages indexed. We simply didn’t show up. As of today, we have 71 pages indexed and  show 506 incoming links. Quite impressive, and even more impressive is the fact that we now rank in Yahoo for two of our most important target keyword phrases. In fact, today we received our first human traffic from yahoo for our main target phrase.  I expected Yahoo to take several weeks, but the sheer linking power of LE seems to be reducing that expected lag. As for msn, the same starting scenario with us having 0 pages indexed just last week. Today, we see 29 indexed pages. Definitely an impressive start for only being 4 days in. 

Next we will take a look at the most important data. The analytics:

traffic2

As we can see from the graph, it’s quite the increase over our starting point. Recall that we were averaging 0-2 searches per day. Yesterday we had 98 incoming hits from search engines, and in fact we had several new members register as a result. 

That’s about all for today. Expect another update on Monday as we follow the progress :)

Does Domain Age Influence Ranking?

May 7th, 2009

The order that pages appear in the results of a search at a search engine may be influenced by the number of pages that link to that page, and by rankings of the pages that link to that page.

When a site is linked to by a popular and trusted domain, that link might provide more value (and a higher ranking) than a link from a site that is less popular and trusted.

Ages of Linking Domains

A new patent application from Microsoft adds another twist, by also ranking domains based upon the ages of domains which link to those domains.

Why?

The cost of purchasing a domain has decreased significantly in recent years, and some domain registrars have offered free domain registrations for up to thirty to sixty day trial periods.

A spammer might take advantage of an offer like that to build something known as a link farm, which is a spam technique in which spammers “purchase or otherwise obtain a large number of sites and interlink the sites together to increase the sites’ rankings by artificially increasing the number of contributing domains for some or all of the sites.”

The Microsoft patent application is:

Ranking Domains Using Domain Maturity
Invented by Janine Crumb, Krishna C Gade, Rangan Majumder, Vishnu Challam
Assigned to Microsoft
US Patent Application 20080086467
Published April 10, 2008
Filed October 10, 2006

Abstract

Ranking domains for search engines is provided herein. To rank a domain, contributing domains associated with the domain are identified. Additionally, the maturity of each of the contributing domains is determined.

A rank for the domain is then determined based at least in part on the maturity of each of the contributing domains. The domain rankings may then be used to order results for search queries.

This patent application assumes that newer domains have a “higher likelihood of being spam and/or being a part of a web farm that attempts to artificially inflate domain rankings for domains in the web farm.”

By looking at the age of domains that link to those newer domains when determining a rank for a domain, domains which have links from older domains “may be ranked higher than spam domains and/or less relevant domains.”

Maturity and Immaturity of Contributing Domains

A search engine may access domain information by communicating with the web servers that those are hosted upon, to access and/or update domain information, such as domain registration date, domain expiration date, domain swapping date(s), and a set of linked domains.

The maturity of a contributing domain may be based upon when that domain was registered or was first discovered by a search engine (if the domain information doesn’t provide a registration date).

Maturity may mean labeling a domain as mature immature. For example, contributing domains registered more than a year ago could be considered mature domains.

Ranking based upon the age of contributing domain could involve looking at:

1) Mature Domains only — A domain’s rank might be calculated based in part on only mature contributing domains that are associated with the domain.

2) Mature and Immature Domains — rankings might be influenced by both mature and immature domains, but the value of the rank for the immature domains might be based upon the ranks of the mature domains linking to those immature domains.

While some new domains can be spam, not all are. New domains that are popular, provide value, and gain links from older domains could be allowed to pass along the rankings from the mature domains associated with those new domains.

3) Instead of distinquishing between domains linking to a domain as either a mature or immature, the age of contributing (linking) domain might be used to provide a percentage of ranking to a domain:

For example, in an embodiment, domains that have been registered for more than ten years may contribute 100% of their accumulated ranks to a target domain’s rank;

domains that have been registered from six to ten years may contribute 75% of their accumulated ranks to a target domain’s rank;

domains that have been registered from three to six years may contribute 50% of their accumulated ranks to a target domain’s rank;

domains that have been registered for one to three years may contribute 25% of their accumulated ranks to a target domain’s rank; and

domains that have been registered for less than one year may only contribute 10% of their accumulated ranks.

Resetting Maturity for Expired or Swapped Domains

The maturity of a domain might be reset if the domain expires or if the domain is swapped.

It’s possible for spammers to buy a block of domains that have expired as well as new domains to form a Web Farm. By a search engine resetting the maturity of a domain, spammers don’t benefit from the purchase or swapping of an older domain.

Conclusion

The effect of a process like this might make it look like new domains are being penalized by search engines because they are new (what someone might perhaps call something like a “sandbox” effect).

If a process like this were in place, it might cause new domains that aren’t linked to by older domains to not rank highly, at least until they get some links from older domains.

Link Exchange: A case study

May 6th, 2009

I’m starting a new series here with some case studies related to software I use on a regular basis. While the majority will be Black Hat software, i’m starting off with one that can be used by both Black Hat site builders and White Hat site owners. The software in question is Link Exchange which is part of the Simplified Search Engine Suite. As of today I am told that there are about 200 servers participating in the Link Exchange and close to 100 million links.

With that out of the way lets get started with a little bit of background and the setup.

I’m going to be using the software on this blog. I figure this is a fantastic starting place because the site is fairly new, and I have done absolutley nothing for link building or SEO. So, to get going I did some basic keyword research, checked out the competition for my main phrases and ended up with a list of 70 or so keywords as a starting place. Now, before we jump into the rest, lets take a look at the search engine traffic for the month of April for this blog. I’m going to focus only on search engine traffic rather than overall traffic so referring sites, subscribers, and direct traffic don’t taint the stats and throw the graphs off. 

April Traffic - Blackhat360 Black hat seo blog  As you can clearly see, we’re starting off with a clean slate. The search engine traffic for this place is basically zero. I checked the serps for all 72 of my targeted keywords/key phrases and we didn’t rank in the top 100 for any of them. I’m also keeping tabs on the total number of indexed pages for the domain which as of the start date was 316 pages.

So, lets fire up Link Exchange and see what we can do. The software is web based, so you do need your own dedicated server to run it. I happen to have several including the one that hosts this blog, so I installed it here. The system is points based, so you need to display links somewhere in order to receive points which in turn can be spent on incoming links. The cool part is that the sites that display the links and the ones you receive links with are completely independent. In my case i’m using some forums I run to gain link points, then spending those points on this blog. Make sense? The interface is very easy to use as you will be able to see in my screen captures. I started by logging in and clicking the import links button. As you can see in the picture, you are presented with an easy to use form where you simply insert your anchor text and url. There is also an option to flag the link as a white hat link or a black hat link. Even though the blog is a black hat blog, the blog itself is white hat in nature, so I selected that as the option. The developers run through the links every few hours to make sure no one is trying to sneak a black hat link into the white hat links and so forth. You also have the option of uploading a csv with all of your links. This comes in handy when you have a large number of links to submit at one time. I didn’t, so the quick import works fine.

quick import - Link ExchangeNow that we have some links in the system, we need to configure our sharing options. You can turn linking on and off at any time through the interface, but the real power comes from the advanced options. Here you chose which search engines are actually allowed to view the links. Did I mention that these links are cloaked and only viewable by search engines? Very handy for keeping people from being able to simply use the system to find other users of teh system. So, in my case I want all search engines to view the links, so I leave the boxes unchecked. Next up are the link throttle options. This allows you to determine how often your pages or domains receive links. This can be especially handy for black hat sites where slower link building may be more desirable. In my case i’m going to go with unlimited links for now to see what this system can actually deliver volume wise. The last set of options are all related to the anchor text. This one is interesting. You can actually have the system vary your anchor text based on several criteria so your incoming links are more varied and natural looking. Say you have a 5 word phrase you are targeting for example. This allows you to randomly drop words from the beginning of the phrase, the end of the phrase or both. You can also manually set how often this should take place based on a percentage of the overall link views.Link Settings - Link Exchange 

I took care of all of this Monday morning. Now, nearly 48 hours later, how has it done? I must say I didn’t expect results this quickly. 48 hours in, my indexed pages have risen from 316 to 415 as of this writing. I also noticed that all of my pages have been recrawled and recached which is nice because I made some changes to the page titles a couple weeks ago for better SEO. So, that right there accounts for a 24 percent increase in the number of indexed pages. More important than that however is the results of the keyword ranking. I am now in the top 100 for nearly all of my keywords with the majority o nthe first or second page of Google. My main key phrase that i’m targeting is now actually sitting at number 1. It was number 5 yesterday, so it moved up another 4 spots. Now, due to an error on my part, I don’t have full analytics traffic data for yesterday, so the following image accounts for yesterday evening and part of today. Remember that just a couple days ago, this site was hovering around 0-2 search engine hits a day.

may traffic 1 Link ExchangeNot bad, 15 hits in less than 24 hours.  That’s quite the increase over the past days. It’s exciting to see results so quickly. The coming days/weeks should be quite interesting. 

Earlier I talked about the link throttle a bit. I wanted to see what sort of volume the system could deliver. Lets take a look at that graph in the built in Link Exchange Analytics system:leanalytics1

I blanked out the keywords i’m targeting. I have to keep some info to myself ;). As we can see, google saw my links 166,000 times in a 24 hour period. That’s very impressive and shows that the system can clearly deliver on volume if the link throttle is turned off. The graph shows the number of link views per hour, we can see in the keyword table which anchor text was shown and how often, we can also see for which domains the majority of the links were shown. 

That’s about it for today. Follow me again tomorrow for a followup as I continue to use the system.

Do Search Engines Love Blogs?

May 5th, 2009

Microsoft Explores an Algorithm to Increase PageRank for Pages Linked to by Blogs.

In the new patent document, they ask if the rankings of web pages in search results would be improved by a providing a slight increase in the PageRank of pages linked to by blogs. They tell us that:

This idea is based on the assumption (or hope) that blogs are still mostly human-authored, and that links from blogs generally represent sincere endorsements on the part of the authors.

 

The December post explored how a search engine might be able to identify blog pages and distinquish them from non blog pages, and told us that:

Search engines are increasingly implementing features that restrict the results for queries to be from blog pages.

But limiting the number of blogs that show up in search results doesn’t necessarily mean that a search engine doesn’t like blogs. It may mean that search engines would prefer to show a diversified set of search results, including blog pages and other results.

Ranking Algorithms

Search engines often look a couple of different kinds of ranking factors when determining the order that search results are shown to searchers.

Query-Independent and Query-Dependent

One way to classify ranking algorithms is query-dependent (or dynamic) or query-independent (or static).

Query-dependent ranking algorithms rely upon the query terms someone uses to rank pages, while query-independent look at other factors such as how important they may believe a page to be based upon things such as whether or not important pages link to that page (an example of a query-independent ranking algorithm would be PageRank).

Query-independent ranking algorithms assign a quality score to each document on the web, and can be run ahead of time. Query-dependent ranking algorithms depend upon the query used, and have to be run when a user submits a query.

Content, Usage, and Link Based Ranking Algorithms

It’s also possible to classify ranking algorithms as content-based, usage-based, and link-based.

Content-based ranking algorithms - use the words in a document to rank the document among other documents. For instance, a higher score might be assigned to a document that contains the query terms at the beginning of a document, in a prominent font, or in a certain kind of HTML element.

Usage-based ranking algorithms - may assign a score based on estimages of how often documents are viewed from looking at web proxy logs or looking at click-throughs on search engine results pages.

Link-based ranking algorithms - look at the hyperlinks between web pages to rank those pages, assigning a score to pages based upon links pointing to pages. endorsement of the page.

PageRank - an example of a query-independent link-based ranking algorithm.

The PageRank formula is often explained as follows. Consider a web surfer who is performing a random walk on the web. At every step along the walk, the surfer moves from one web page to another, using the following algorithm.

With some probability d, the surfer selects a web page uniformly at random and jumps to it; otherwise, the surfer selects one of the outgoing hyperlinks in the current page uniformly at random and follows it. Because of this metaphor, the number d is sometimes called the “jump probability,” namely the probability that the surfer will jump to a completely random page.

If the web surfer jumps with probability d and there are |V| web pages, the probability of jumping to a particular page is d/|V|. Since any page can be reached by jumping, every page is guaranteed a score of at least d/|V|. The PageRank of a particular web page is then the fraction of time that the random surfer will spend at that page.

But what if that surfer started favoring pages that were linked to by blogs a little more?

Splitting PageRank

One of the problems behind using PageRank is that some commercial web sites try to inflate PageRank by creating links that point to a page solely for the purpose of endorsing that page, artificially increasing the value of the page.

This patent filing describes in some detail how a portion of PageRank from a page might be split (or distributed) equally amongst the links found on the pages of a site, and how the distribution of PageRank could be slightly altered to favor (or show a bias towards) pages that are linked to by blogs.

If blogs are, as the authors note in the patent, “still mostly human authored, and generally represent sincere endorsements of their authors,” then this bias might help counteract the artifical inflation of PageRank scores by people who would create links pointing to pages solely for the purpose of artifically increasing the PageRank of pages.

The patent filing is:

Ranking Method using Hyperlinks in Blogs
Inventors: Steve Chien and Dennis Fetterly
Assigned to Microsoft
US Patent Application 20080243812
Published October 2, 2008
Filed March 30, 2007

Abstract

A method for static ranking of web documents is disclosed. Search engines are typically configured such that search results having a higher PageRank.RTM. score are listed first. A modified scoring technique is provided whereby the score includes a reset vector that is biased toward web pages linked to blogs. This requires identifying web pages as either blogs or non-blogs.

Identifying Blogs

Some of the kinds of things that a search engine crawling program might look at when deciding whether a page is from a blog might include:

  1. Whether a page is hosted in a known blog hosting DNS domain such as blogspot or wordpress.com
  2. What features are containted in the non-HTML markup words and phrases contained in the page
  3. What the targets of outgoing links might be in the page, and
  4. Whether the string “blog” occurs in the URL

Experimenting with a Bias Towards Pages Linked to by Blogs

The authors of this patent performed experiments where they downloaded over 472 million pages, and found links to an additional 6 Billion pages within those pages.

They reranked the PageRank of these pages using a bias towards pages that they identified were linked to by blogs, with a preference towards using blog pages that had higher PageRanks, which they tell us tend to be “frequently updated, more informational rather than personal, and free of spam.”

They also tell us that some other characteristics of blogs may prove useful in refining this technique, such as looking at the number of subscribers to a particular blog, and associating a higher endorsement value to blogs with greater numbers of subscribers.

Conclusion

Can sending more PageRank to pages that are linked to by blogs something that will increase the relevance and importance of pages that show up in search results? Are links to pages from blogs still actual endorsements from the authors of those blogs?

Do search engines love blogs?

Search Engine Series: Indications of Web Spam

May 4th, 2009

A patent application from Microsoft looks at content generated to spam search engines. Here’s the problem, as noted in the patent filing:

In the best case, search engine optimizers help web site designers generate content that is well-structured, topical, and rich in relevant keywords or query terms. Unfortunately, some search engine optimizers go well beyond producing relevant pages: they try to boost the ratings of a web site by loading pages with a wide variety of popular query terms, whether relevant or not. In fact, some SEOs go one step further: Instead of manually creating pages that include unrelated but popular query terms, they machine-generate many such pages, each of which contains some monetizable keywords (i.e., keywords that have a high advertising value, such as the name of a pharmaceutical, credit cards, mortgages, etc.). Many small endorsements from these machine-generated pages result in a sizable page rank for the target page. In a further escalation, SEOs have started to set up DNS servers that will resolve any host name within their domain, and typically map it to a single IP address.

Most if not all of the SEO-generated pages exist solely to mislead a search engine into directing traffic towards the “optimized” site; in other words, the SEO-generated pages are intended only for the search engine, and are completely useless to human visitors.

I recognized this quote, which is taken from an interesting research paper from Microsoft, Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages. If you are interested in how search engines are attempting to fight web spam, it’s a “must read” paper.

 

It appears that this patent is an attempt to take some of the research reported upon in that paper, and define a way to use it in a process that can help the search engine fight web spam. But, it isn’t a rehashing of that paper, and it covers some new territory. Definitely worth a look, especially if you are concerned that your pages may be mistaken for spam by the search engines.

Using content analysis to detect spam web pages
Inventors: Marc Alexander Najork, Dennis Craig Fetterly, Mark Steven Manasse, and Alexandros Ntoulas
Assigned to Microsoft
US Patent Application 20060184500
Published August 17, 2006
Filed: February 11, 2005

Abstract

Evaluating content includes receiving content, analyzing the content for web spam using a content-based identification technique, and classifying the content according to the analysis. An index of analyzed contents may be created. A system for evaluating content includes a storage device configured to store data and a processor configured to analyze content using content-based identification techniques to determine whether web spam is present.

The patent describes some measures that the authors may be looking at when viewing the content of a page to determine whether or not the page is intended only to spam a search engine. The authors note that other steps and other metrics may also be involved.

Classification of Content

Metrics about pages are collected and fed into a classifier program which uses weighted scores to distinquish good pages from bad ones. The classifier program starts with an initial data set, called the training set, which is divided into positive and negative examples. That training set looks at all of the features of the positive and negative examples in combination, in an attempt to separate the positive examples (non-spam) from the negative examples (spam).

Using a classifier like this may mean that once the dividing line is made, additional data may be looked at to see if it can be used to distinquish good pages from bad ones. We know from the “Spam, Damn Spam, and Statistics” paper that Microsoft is also looking at other features of pages and sites.

According to the patent filing, some classes of spam web pages can be detected by analyzing the content of the page and looking for “unusual” properties, such as:

  • The page contains unusually many words,
  • The page contains unusually many words within a title HTML element (<title>here!</title>)
  • The ratio of HTML markup to visible text is low,
  • The page contains an unusually large number of very long or very short words,
  • The page contains repetitive content,
  • The page contains unusually few common words (”stop words”), or
  • The page contains a larger-than-expected number of popular n-grams (sequences of n words)

These metrics or filters can be input into a classifier for deciding whether or not a page is spam or determining the likelihood or probability that the page is spam, by comparing the outputs of one or more of the metrics, alone or in combination, to one or more thresholds.

The patent mentions an example reference book which describes the existing body of work in machine learning: Pattern Classification (my link doesn’t go to the book itself, but rather to a page from one of the authors, which has a great series of powerpoint slides about material in the book).

Identifying Web Spam on the Fly

The patent describes methods for finding spam pages during web crawls and or evaluating content on the fly.

Here’s a summary of the process for identifying spam through content, on the fly, from the patent application:

  1. Search engine receives user input to begin a particular query,
  2. Search engine performs the query,
  3. Search engine receives the query results,
  4. Search engine (or processor or classifier, for example) evaluates the results using various metrics,
  5. After evaluation the search engine analyzes the evaluations to determine what contents are likely web spam.
  6. From that analysis, the search engine may identify web pages as web spam and may record or store the contents in an index for future queries,
  7. Query results are then output to the searcher.
  8. Detected web spam could excluded from a search engine index, given a low search ranking, or treated in a manner so that user queries are not affected or populated with web spam, which could lead to more relevant search results, or at least the omission of some irrelevant results.

Indications of Web Spam?

The list above of some “unusual properties” that may be looked for is examined in greater detail within the patent application. The following are paraphrases of some of those and some additional metrics. I’d recommend looking at the patent for their more detailed treatment of these. Keep in mind that many are just one factor to be looked at in conjunction with the others before a determination is made that a page is intended to spam a search engine.

1. As the number of words on a page increases, the probability of spam being present on that page increases.

2. As the number of words in the title of a web page increases, the probability of web spam being present dramatically increases.

3. As the visible content of the page increases, the probability of web spam being present increases to a point and then decreases dramatically.

4. As the fraction of anchor words increases (as a percentage of all the words on a page), the probability of web spam increases.

5. Web spam is more likely to occur in web pages having very long or very short words, so an average word length metric can be used to identify spam pages.

6. As the zipRatio of a page increases beyond a threshold, the probably of web spam being present on a web pages increases dramatically. A zip ratio is calculated by dividing the size (in bytes) of uncompressed visible text (such as text other than HTML markup) by the size (in bytes) of compressed visible text.

7. As a percentage (and distribution) of stop words (the most commonly used words in a search engine corpus) used on a page deceases, the probablility of web spam increases.

For example, the 100 most common words in a very large corpus representative of the English language is determined, e.g., by examining all the English web pages downloaded by the crawler (the same applies to other languages as well). It is then determined what fraction of the words on a single web page is drawn from the 100 most frequent words in the entire corpus. For example, words like “the”, “a”, “from”, etc. are among the 100 most frequent English words. If a web page had no occurrences of any of these words, but 100 occurrences of “echidna” (a spiny anteater and a rare word), it is determined that the page has 0% overlap with the top-100 words.

8. Pages are also reviewed for the existence of commonly ocurring sequences of consecutive words (n-grams), their position within a document, and commonly ocurring words that may appear after those sequences. Probabilities of those are calculated from documents on the web, and thresholds are defined which could be used to determine whether or not a page should be identified as web spam.

What is Black Hat SEO?

April 30th, 2009

This is a question that has crossed all of our minds at one point or another. Stumble into a Black Hat discussion on a general internet marketing forum and you are likely left with the impression that we are evil scum of the Earth that is out to destroy the very fabric of society while murdering baby kittens. 

The reality however, is far from that. The majority of Black Hat SEO employs the EXACT same strategies and techniques used by any good White Hat marketer, but with a twist. We’re lazy….. So lazy in fact that we like everything to be completely automated. Why go out and hand pick backlinks when you can write a script to do it for you? Why spend weeks perfecting a single page of ad copy when you can write a program to create thousands of variations in a few minutes? 

It is inevitable that someone brings up cookie stuffing or cross site scripting. Sure, there are black hats that chose to partake in such activities, but that doesn’t make them black hat activities. The majority of us you will find still have a solid moral compass and don’t cross certain lines. 

I suppose my main point here is to point out how silly the distinction between black and white really is. It isn’t as clear cut as the titles imply. Hopefully that leaves you with a little bit to think about.

Scraping with PHP and DOM

April 24th, 2009

I have added a new article outlining the basic process of content scraping using PHP and DOM. YOu can read the full article and view the sample code here:

Scraping with PHP and DOM

Author: admin Categories: Tools Tags: , ,

How does a search engine decide which duplicate to show in search results?

April 24th, 2009

Lets start with a question we have all thought about at one point or another. A question that our past two days articles have been leading up to.

“How does a search engine decide which duplicate to show in search results, and which ones not to show?”

How do they choose? Pagerank? First one published? Shortest url? Article with the most links?

It doesn’t seem to be any one signal. It’s not pagerank alone, or distance from root directory. It’s probably not the first one published, because many sites are dynamic, and the time stamp on the original may be later than on the copy, and the first copy spidered might be the one the search engines think is the oldest. It doesn’t appear to be perceived authority. It could have something to do with the number and quality of inbound and outbound links from a page. It could be a mix of all of those things and others.

So what is it then? Lets dive into some research papers and find out!

Collapsing Equivalent Results

Thanks, Microsoft.

A new patent application published by Microsoft discusses some of the signals that may be used to determine which results to show, and which to filter, at least possibly in Windows Live Search.

It may not include all of the signals being looked at - some of those might be trade secrets.

The practices at Google and Yahoo and Ask.com may be different.

But, all of the major search engines are striving to create good user experiences for people who search using their services. And all of them want to avoid duplicate results filling up the early spots on search result pages. The patent application does provide some insight into what search engines consider in choosing which pages to show, and which to hide.

I was surprised by a couple of the factors, and by the appearance of something I believe I’ve seen Matt Cutts refer to as “Pretty URLs.”

System and method for optimizing search results through equivalent results collapsing
Invented by Brett D. Brewer
Assigned to Microsoft
US Patent Application 20060248066
Published November 2, 2006
Filed: April 28, 2005

Abstract

A system and method are provided for optimizing a set of search results typically produced in response to a query. The method may include detecting whether two or more results access equivalent content and selecting a single user-preferred result from the two or more results that access equivalent content. The method may additionally include creating a set of search results for display to a user, the set of search results including the single user-preferred result and excluding any other result that accesses the equivalent content. The system may include a duplication detection mechanism for detecting any results that access equivalent content and a user-preferred result selection mechanism for selecting one of the results that accesses the equivalent content as a user-preferred result.

The Duplicate Content Problem

1. A search engine finds documents that match queries and assigns them scores to determine the order in which they should be displayed.

2. Pages that may be very relevant as results may also be duplicates, or near duplicates, of each other.

3. Example: www.ymca.net and www.ymca.net/index.jsp lead to the same content with the first URL redirecting to the second one. And, www.ymca.com and www.ymca.com/index.jsp could be mirrors of www.ymca.net.

4. A search engine might include all four results in the top ten results of a search for the query “ymca”.

5. This is a bad user experience, because it keeps the searcher from seeing other results that might also be relevant, on the first page of results.

Choosing One Result

The system described would include:

* A crawler that visits web pages, and indexes and stores results in an index/storage system.

* Ranking components that may rank located results in response to searchers’ queries.

* Results storage components which may have a cache for recently stored results and an index system for storage of additional results.

* A duplication detection mechanism which would detect results having duplicate content. A technique for detecting duplicates referenced in the patent application involves using “shingleprints” as described in another Microsoft U.S. patent application, Method for duplicate detection and suppression.

* A result selection module decides which result to display to searchers, regardless of whether shingleprints or other methods are used to determine which are duplicates.

Result Selection Module

Some parts which may be included in the result selection module:

  • A query independent ranking component (something like pagerank, or a page quality score, or others, or combinations of all),
  • A result analysis component,
  • A navigation model selection mechanism,
  • a click through rate determination component,
  • A user-preferred result selection mechanism, and;
  • Result storage.

Upon finding that results are duplicates, or very near duplicates, those results would be placed in Result Storage, but the search engine would not display them all.

The Result Selection Module would determine (through the result analysis component) which was the “user preferred selection” (via the user-preferred result selection mechanism) to show in response to the query.

A different URL might be chosen as the URL that the search engine actually uses to navigate to the page (chosen via the navigation model selection mechanism).

Some Factors the Results Analysis Component Might Consider

* Extension - .com might be a better choice than .net - it “appeals” to users because they understand it

* Shorter URLs - In the YMCA example above, the user-preferred version of the URL may be www.ymca.com both because “.com” is more common than “.net” and because the www.ymca.com URL is shorter than the two “index.jsp” results.

* The Navigational Model Selection might chose a different URL - while the searcher is shown www.ymca.com, the link might actually go to www.ymca.com/index.jsp, which is selected by the navigation model selection mechanism and is stored in the result storage area, in order to save the user a redirect. Eliminating redirects leads to the fastest result.

* The URL might contain keywords that appear in the query. In that case, the URL acts as a document summary. So, www.sfgiants.com might be a better choice than www.mlb.com/sf/id1223/xyx.com when the query is “sf giants”

* Searcher Location or language - A different duplicate might be chosen based upon where the person searching is from. So a London-based searcher might see www.example.co.uk where a New York searcher would get www.example.com

* Popularity - how well linked to the page is by other sites might be determined by the query independent ranking component.

* Click through rates might be tested, and the version of the URL with the highest may be determined by the click through rate determination component, acting upon the assumption that high click-through rates indicate that users find the result satisfactory.

* Fewest redirects - as determined by the navigation model.

The user-preferred result selection mechanism uses input from the query independent ranking component, the result analysis component, and the click through determination component to select a user-preferred result. (That sounds much better than the technical term I’ve seen Matt Cutts use regarding displayed URLs in results in the context of redirects - the “prettiest URL.”)

Conclusion

So, something like pagerank does matter when it comes to filtering equivalent results, as does searcher location, clickthrough rates, amount of redirects, words used in URLs, length of URL, choice of tld, and possibly other signals.

The other interesting thing here is that a search engine may display one URL for searchers, and use a different one for navigation - Pretty URLs for people, and more direct URLs to navigate to the page.

Search Engine 101

February 25th, 2009

Search engines match queries against an index that they create. The index consists of the words in each document, plus pointers to their locations within the documents. This is called an inverted file. A search engine or IR system comprises four essential modules:

  • A document processor
  • A query processor
  • A search and matching function
  • A ranking capability

While users focus on “search,” the search and matching function is only one of the four modules. Each of these four modules may cause the expected or unexpected results that consumers get when they use a search engine.

Document Processor
The document processor prepares, processes, and inputs the documents, pages, or sites that users search against. The document processor performs some or all of the following steps:

  • Normalizes the document stream to a predefined format.
  • Breaks the document stream into desired retrievable units.
  • Isolates and metatags subdocument pieces.
  • Identifies potential indexable elements in documents.
  • Deletes stop words.
  • Stems terms.
  • Extracts index entries.
  • Computes weights.
  • Creates and updates the main inverted file against which the search engine searches in order to match queries to documents.

Steps 1-3: Preprocessing. While essential and potentially important in affecting the outcome of a search, these first three steps simply standardize the multiple formats encountered when deriving documents from various providers or handling various Web sites. The steps serve to merge all the data into a single consistent data structure that all the downstream processes can handle. The need for a well-formed, consistent format is of relative importance in direct proportion to the sophistication of later steps of document processing. Step two is important because the pointers stored in the inverted file will enable a system to retrieve various sized units — either site, page, document, section, paragraph, or sentence.

Step 4: Identify elements to index. Identifying potential indexable elements in documents dramatically affects the nature and quality of the document representation that the engine will search against. In designing the system, we must define the word “term.” Is it the alpha-numeric characters between blank spaces or punctuation? If so, what about non-compositional phrases (phrases in which the separate words do not convey the meaning of the phrase, like “skunk works” or “hot dog”), multi-word proper names, or inter-word symbols such as hyphens or apostrophes that can denote the difference between “small business men” versus small-business men.” Each search engine depends on a set of rules that its document processor must execute to determine what action is to be taken by the “tokenizer,” i.e. the software used to define a term suitable for indexing.

Step 5: Deleting stop words. This step helps save system resources by eliminating from further processing, as well as potential matching, those terms that have little value in finding useful documents in response to a customer’s query. This step used to matter much more than it does now when memory has become so much cheaper and systems so much faster, but since stop words may comprise up to 40 percent of text words in a document, it still has some significance. A stop word list typically consists of those word classes known to convey little substantive meaning, such as articles (a, the), conjunctions (and, but), interjections (oh, but), prepositions (in, over), pronouns (he, it), and forms of the “to be” verb (is, are). To delete stop words, an algorithm compares index term candidates in the documents against a stop word list and eliminates certain terms from inclusion in the index for searching.

Step 6: Term Stemming. Stemming removes word suffixes, perhaps recursively in layer after layer of processing. The process has two goals. In terms of efficiency, stemming reduces the number of unique words in the index, which in turn reduces the storage space required for the index and speeds up the search process. In terms of effectiveness, stemming improves recall by reducing all forms of the word to a base or stemmed form. For example, if a user asks for analyze, they may also want documents which contain analysis, analyzing, analyzer, analyzes, and analyzed. Therefore, the document processor stems document terms to analy- so that documents which include various forms of analy- will have equal likelihood of being retrieved; this would not occur if the engine only indexed variant forms separately and required the user to enter all. Of course, stemming does have a downside. It may negatively affect precision in that all forms of a stem will match, when, in fact, a successful query for the user would have come from matching only the word form actually used in the query.

Systems may implement either a strong stemming algorithm or a weak stemming algorithm. A strong stemming algorithm will strip off both inflectional suffixes (-s, -es, -ed) and derivational suffixes (-able, -aciousness, -ability), while a weak stemming algorithm will strip off only the inflectional suffixes (-s, -es, -ed).

Step 7: Extract index entries. Having completed steps 1 through 6, the document processor extracts the remaining entries from the original document. For example, the following paragraph shows the full text sent to a search engine for processing:

Milosevic’s comments, carried by the official news agency Tanjug, cast doubt over the governments at the talks, which the international community has called to try to prevent an all-out war in the Serbian province. “President Milosevic said it was well known that Serbia and Yugoslavia were firmly committed to resolving problems in Kosovo, which is an integral part of Serbia, peacefully in Serbia with the participation of the representatives of all ethnic communities,” Tanjug said. Milosevic was speaking during a meeting with British Foreign Secretary Robin Cook, who delivered an ultimatum to attend negotiations in a week’s time on an autonomy proposal for Kosovo with ethnic Albanian leaders from the province. Cook earlier told a conference that Milosevic had agreed to study the proposal.

Steps 1 to 6 reduce this text for searching to the following:

Milosevic comm carri offic new agen Tanjug cast doubt govern talk interna commun call try prevent all-out war Serb province President Milosevic said well known Serbia Yugoslavia firm commit resolv problem Kosovo integr part Serbia peace Serbia particip representa ethnic commun Tanjug said Milosevic speak meeti British Foreign Secretary Robin Cook deliver ultimat attend negoti week time autonomy propos Kosovo ethnic Alban lead province Cook earl told conference Milosevic agree study propos.

The output of step 7 is then inserted and stored in an inverted file that lists the index entries and an indication of their position and frequency of occurrence. The specific nature of the index entries, however, will vary based on the decision in Step 4 concerning what constitutes an “indexable term.” More sophisticated document processors will have phrase recognizers, as well as Named Entity recognizers and Categorizers, to insure index entries such as Milosevic are tagged as a Person and entries such as Yugoslavia and Serbia as Countries.

Step 8: Term weight assignment. Weights are assigned to terms in the index file. The simplest of search engines just assign a binary weight: 1 for presence and 0 for absence. The more sophisticated the search engine, the more complex the weighting scheme. Measuring the frequency of occurrence of a term in the document creates more sophisticated weighting, with length-normalization of frequencies still more sophisticated. Extensive experience in information retrieval research over many years has clearly demonstrated that the optimal weighting comes from use of “tf/idf.” This algorithm measures the frequency of occurrence of each term within a document. Then it compares that frequency against the frequency of occurrence in the entire database.

Not all terms are good “discriminators” — that is, all terms do not single out one document from another very well. A simple example would be the word “the.” This word appears in too many documents to help distinguish one from another. A less obvious example would be the word “antibiotic.” In a sports database when we compare each document to the database as a whole, the term “antibiotic” would probably be a good discriminator among documents, and therefore would be assigned a high weight. Conversely, in a database devoted to health or medicine, “antibiotic” would probably be a poor discriminator, since it occurs very often. The TF/IDF weighting scheme assigns higher weights to those terms that really distinguish one document from the others.

Step 9: Create index. The index or inverted file is the internal data structure that stores the index information and that will be searched for each query. Inverted files range from a simple listing of every alpha-numeric sequence in a set of documents/pages being indexed along with the overall identifying numbers of the documents in which the sequence occurs, to a more linguistically complex list of entries, the tf/idf weights, and pointers to where inside each document the term occurs. The more complete the information in the index, the better the search results.

Query Processor
Query processing has seven possible steps, though a system can cut these steps short and proceed to match the query to the inverted file at any of a number of places during the processing. Document processing shares many steps with query processing. More steps and more documents make the process more expensive for processing in terms of computational resources and responsiveness. However, the longer the wait for results, the higher the quality of results. Thus, search system designers must choose what is most important to their users — time or quality. Publicly available search engines usually choose time over very high quality, having too many documents to search against.

The steps in query processing are as follows (with the option to stop processing and start matching indicated as “Matcher”):

  • Tokenize query terms.
  • Recognize query terms vs. special operators. ————————> Matcher

  • Delete stop words.
  • Stem words.
  • Create query representation.
  • ————————> Matcher

  • Expand query terms.
  • Compute weights.
  • ————————> Matcher

Step 1: Tokenizing. As soon as a user inputs a query, the search engine — whether a keyword-based system or a full natural language processing (NLP) system — must tokenize the query stream, i.e., break it down into understandable segments. Usually a token is defined as an alpha-numeric string that occurs between white space and/or punctuation.

Step 2: Parsing. Since users may employ special operators in their query, including Boolean, adjacency, or proximity operators, the system needs to parse the query first into query terms and operators. These operators may occur in the form of reserved punctuation (e.g., quotation marks) or reserved terms in specialized format (e.g., AND, OR). In the case of an NLP system, the query processor will recognize the operators implicitly in the language used no matter how the operators might be expressed (e.g., prepositions, conjunctions, ordering).

At this point, a search engine may take the list of query terms and search them against the inverted file. In fact, this is the point at which the majority of publicly available search engines perform the search.

Steps 3 and 4: Stop list and stemming. Some search engines will go further and stop-list and stem the query, similar to the processes described above in the Document Processor section. The stop list might also contain words from commonly occurring querying phrases, such as, “I’d like information about.” However, since most publicly available search engines encourage very short queries, as evidenced in the size of query window provided, the engines may drop these two steps.

Step 5: Creating the query. How each particular search engine creates a query representation depends on how the system does its matching. If a statistically based matcher is used, then the query must match the statistical representations of the documents in the system. Good statistical queries should contain many synonyms and other terms in order to create a full representation. If a Boolean matcher is utilized, then the system must create logical sets of the terms connected by AND, OR, or NOT.

An NLP system will recognize single terms, phrases, and Named Entities. If it uses any Boolean logic, it will also recognize the logical operators from Step 2 and create a representation containing logical sets of the terms to be AND’d, OR’d, or NOT’d.

At this point, a search engine may take the query representation and perform the search against the inverted file. More advanced search engines may take two further steps.

Step 6: Query expansion. Since users of search engines usually include only a single statement of their information needs in a query, it becomes highly probable that the information they need may be expressed using synonyms, rather than the exact query terms, in the documents which the search engine searches against. Therefore, more sophisticated systems may expand the query into all possible synonymous terms and perhaps even broader and narrower terms.

This process approaches what search intermediaries did for end users in the earlier days of commercial search systems. Back then, intermediaries might have used the same controlled vocabulary or thesaurus used by the indexers who assigned subject descriptors to documents. Today, resources such as WordNet are generally available, or specialized expansion facilities may take the initial query and enlarge it by adding associated vocabulary.

Step 7: Query term weighting (assuming more than one query term). The final step in query processing involves computing weights for the terms in the query. Sometimes the user controls this step by indicating either how much to weight each term or simply which term or concept in the query matters most and must appear in each retrieved document to ensure relevance.

Leaving the weighting up to the user is not common, because research has shown that users are not particularly good at determining the relative importance of terms in their queries. They can’t make this determination for several reasons. First, they don’t know what else exists in the database, and document terms are weighted by being compared to the database as a whole. Second, most users seek information about an unfamiliar subject, so they may not know the correct terminology.

Few search engines implement system-based query weighting, but some do an implicit weighting by treating the first term(s) in a query as having higher significance. The engines use this information to provide a list of documents/pages to the user.

After this final step, the expanded, weighted query is searched against the inverted file of documents.

Search and Matching Function
How systems carry out their search and matching functions differs according to which theoretical model of information retrieval underlies the system’s design philosophy. Since making the distinctions between these models goes far beyond the goals of this article, we will only make some broad generalizations in the following description of the search and matching function. Those interested in further detail should turn to R. Baeza-Yates and B. Ribeiro-Neto’s excellent textbook on IR (Modern Information Retrieval, Addison-Wesley, 1999).

Searching the inverted file for documents meeting the query requirements, referred to simply as “matching,” is typically a standard binary search, no matter whether the search ends after the first two, five, or all seven steps of query processing. While the computational processing required for simple, unweighted, non-Boolean query matching is far simpler than when the model is an NLP-based query within a weighted, Boolean model, it also follows that the simpler the document representation, the query representation, and the matching algorithm, the less relevant the results, except for very simple queries, such as one-word, non-ambiguous queries seeking the most generally known information.

Having determined which subset of documents or pages matches the query requirements to some degree, a similarity score is computed between the query and each document/page based on the scoring algorithm used by the system. Scoring algorithms rankings are based on the presence/absence of query term(s), term frequency, tf/idf, Boolean logic fulfillment, or query term weights. Some search engines use scoring algorithms not based on document contents, but rather, on relations among documents or past retrieval history of documents/pages.

After computing the similarity of each document in the subset of documents, the system presents an ordered list to the user. The sophistication of the ordering of the documents again depends on the model the system uses, as well as the richness of the document and query weighting mechanisms. For example, search engines that only require the presence of any alpha-numeric string from the query occurring anywhere, in any order, in a document would produce a very different ranking than one by a search engine that performed linguistically correct phrasing for both document and query representation and that utilized the proven tf/idf weighting scheme.

However the search engine determines rank, the ranked results list goes to the user, who can then simply click and follow the system’s internal pointers to the selected document/page.

More sophisticated systems will go even further at this stage and allow the user to provide some relevance feedback or to modify their query based on the results they have seen. If either of these are available, the system will then adjust its query representation to reflect this value-added feedback and re-run the search with the improved query to produce either a new set of documents or a simple re-ranking of documents from the initial search.

What Document Features Make a Good Match to a Query
We have discussed how search engines work, but what features of a query make for good matches? Let’s look at the key features and consider some pros and cons of their utility in helping to retrieve a good representation of documents/pages.

• Term frequency: How frequently a query term appears in a document is one of the most obvious ways of determining a document’s relevance to a query. While most often true, several situations can undermine this premise. First, many words have multiple meanings — they are polysemous. Think of words like “pool” or “fire.” Many of the non-relevant documents presented to users result from matching the right word, but with the wrong meaning.

Also, in a collection of documents in a particular domain, such as education, common query terms such as “education” or “teaching” are so common and occur so frequently that an engine’s ability to distinguish the relevant from the non-relevant in a collection declines sharply. Search engines that don’t use a tf/idf weighting algorithm do not appropriately down-weight the overly frequent terms, nor are higher weights assigned to appropriate distinguishing (and less frequently-occurring) terms, e.g., “early-childhood.”

• Location of terms: Many search engines give preference to words found in the title or lead paragraph or in the metadata of a document. Some studies show that the location — in which a term occurs in a document or on a page — indicates its significance to the document. Terms occurring in the title of a document or page that match a query term are therefore frequently weighted more heavily than terms occurring in the body of the document. Similarly, query terms occurring in section headings or the first paragraph of a document may be more likely to be relevant. • Link analysis: Web-based search engines have introduced one dramatically different feature for weighting and ranking pages. Link analysis works somewhat like bibliographic citation practices, such as those used by Science Citation Index. Link analysis is based on how well-connected each page is, as defined by Hubs and Authorities, where Hub documents link to large numbers of other pages (out-links), and Authority documents are those referred to by many other pages, or have a high number of “in-links” (J. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms. 1998,pp. 668-77).

• Popularity : Google and several other search engines add popularity to link analysis to help determine the relevance or value of pages. Popularity utilizes data on the frequency with which a page is chosen by all users as a means of predicting relevance. While popularity is a good indicator at times, it assumes that the underlying information need remains the same.

• Date of Publication: Some search engines assume that the more recent the information is, the more likely that it will be useful or relevant to the user. The engines therefore present results beginning with the most recent to the less current.

• Length : While length per se does not necessarily predict relevance, it is a factor when used to compute the relative merit of similar pages. So, in a choice between two documents both containing the same query terms, the document that contains a proportionately higher occurrence of the term relative to the length of the document is assumed more likely to be relevant.

• Proximity of query terms : When the terms in a query occur near to each other within a document, it is more likely that the document is relevant to the query than if the terms occur at greater distance. While some search engines do not recognize phrases per se in queries, some search engines clearly rank documents in results higher if the query terms occur adjacent to one another or in closer proximity, as compared to documents in which the terms occur at a distance.

• Proper nouns sometimes have higher weights, since so many searches are performed on people, places, or things. While this may be useful, if the search engine assumes that you are searching for a name instead of the same word as a normal everyday term, then the search results may be peculiarly skewed. Imagine getting information on “Madonna,” the rock star, when you were looking for pictures of madonnas for an art history class.

Summary
The above explanation lays out the range of processing that might occur in a search engine, along with the many options that a search engine provider decides on. The range of options may help clarify users’ frequent surprise at the results their queries return. Up till now, search engine providers have mainly opted for less, versus more, complex processing of documents and queries. The typical search results therefore leave a lot of work to be done by the searcher, who must wend their way through the results, clicking on and exploring a number of documents before finding exactly what they seek. The typical evolution of products and services suggests that this status-quo will not continue. Search engines that go further in the complexity and quality of the processing performed will be rewarded with greater allegiance by searchers, as well as financially rewarding opportunities to serve as the search engine on more organizations’ intranets.

Author: admin Categories: Search Engine Concepts Tags: ,

New Search Engine Series

February 25th, 2009

I’m the type of person that like to know how things work. Search engines are definitely in that category. I’m going to start a new series on here that will go from a brief overview of your basic search engine, then I want to dive into and analyze actual search engine research papers. So, hopefully you guys will enjoy reading and learning about the topic.

Author: admin Categories: General Tags: ,