Archive

Posts Tagged ‘search engines’

Realtime search a BIG problem for Google

May 12th, 2009

goog-search-optionsGoogle has just launched a new “search options” feature on its main search page. When you click on “Search options” you can filter your search by different types of results (videos, forums, and reviews), by time (recent, past 24 hours, past week, past year), as well as seeing related searches, a “wonder wheel” view, or a timeline view.

At Google’s Searchology event, which is going on right now, Marissa Mayer listed the following as the hardest unsolved problems in search:

- Finding the most recent information
- Expressing that you want just one type of result
- Assessing which results are best
- Knowing what you’re looking for
- Expressing your searches in keywords

Notice that real-time search is the No. 1 problem. (Twitter and a bunch of startups from OneRiot to Tweetmeme are also working on it, with the latter two launching their own real-time search efforts today). And it certainly is a problem for Google, even with the new recent results option. Try searching for any of teh top trending results on Twitter right now like Miss California (vs. Twitter search results) or Star Trek (vs. Twitter results), and you don’t even get any Twitter results on Google.

While real-time search is still a big problem, it is not the only problem. Some of the new options address the difficulty of searching back through time. The recent results get as real-time as Google can get, but you can also expand the timeframe. And you can look at an actual timeline of results, which looks for dates within results and then places them chronologically (this is sort of hit or miss—just because a date is mentioned in a text does not mean the entire result is about or from that period of time). Google now also lets you see related searches as an option. And the Wonder Wheel is more of a visual aid to see how different related topics are clustered together. When you click on any spoke of the wheel, it then causes that search term to be at the center. We’ve seen many of these techniques in the past, but Google is giving them a higher profile by putting them in its main search page..
solar-ovens-timeline
solar-ovens-wnder-wheel

What Is Google Squared? It Is How Google Will Crush Wolfram Alpha (Video)

May 12th, 2009

google-squared-labsOne of the next frontiers of search is taking all of the unstructured data spread helter-skelter across the Web and treat it like it is sitting in a nice, structured database. It is easier to get answers out of a database where everything is neatly labeled, stamped, and categorized. As the sheer volume of stuff on the Web keeps growing, keyword search keeps getting closer to its breaking point. Adding structure to the Web is one way to make sense of all that data, and Google is starting the tackle the problem with a Google Labs project called Google Squared, which Marissa Mayer mentioned earlier today at the company’s Searchology briefing.

Google Squared extracts data from Web pages and presents them in search results as squares in an online spreadsheet. Michael was at the event and got a personal demo (see video below). From Michael’s Searchology notes:

Google Squared is launching later this month in labs. Google Squared returns search results in a spreadsheet format. It structures the unstructured data on web pages. So a search for Small Dogs returns results with names, description, size, weight, origin, etc., in columns and rows.

Google is looking for data structures on the web that imply facts, and then grabbing it for Squared results. “It takes an incredible amount of compute power to create one of those squares,” she says.

This type of technology has obvious applications for many types of targeted searches, including product search, health search, scientific searches, you name it. There are dozens of semantic search startups trying to impose structure on the Web to perform similar tricks. Another high-profile search startup which is launching on Monday, Wolfram Alpha, takes a slightly different approach in that it simply ingests massive amounts of information into its own databases where it can query it to its heart’s delight. Already there is a bit of a rivalry between Google and Wolfram because getting back structured results is a major new direction for search.

Wolfram does a pretty good job parsing the information in its own databases, but those databases will never match what is available on the Web. Wolfram’s databases currently store only 10 terabytes of information, a tiny fraction of what is on the Web. (I will be posting my impressions of Wolfram’s search engine soon). Google Squared is an early attempt to take the messy data which exists on the Web and place it into simple tables. It is still very experimental and isn’t always on target, but you can see where this is going. Turning the Web into a giant database will crush any attempt to segregate the “best” information into a separate database so that it can be processed and searched more deeply.

In the video demo below, a search for “camera” sorts the results in different columns by images, description, and manufacturer, resolution, etc.. You can refine results by clicking on a particular column such as manufacturer. A search for “rollercoasters” sorts results by name, image, description, height, length, and number of inversions. But sometimes it gets confused. A search for “spaceships” turns up a Corvette and a missile carrier. It is going to be a while before this makes it out of Google Labs

Do Search Engines Love Blogs?

May 5th, 2009

Microsoft Explores an Algorithm to Increase PageRank for Pages Linked to by Blogs.

In the new patent document, they ask if the rankings of web pages in search results would be improved by a providing a slight increase in the PageRank of pages linked to by blogs. They tell us that:

This idea is based on the assumption (or hope) that blogs are still mostly human-authored, and that links from blogs generally represent sincere endorsements on the part of the authors.

 

The December post explored how a search engine might be able to identify blog pages and distinquish them from non blog pages, and told us that:

Search engines are increasingly implementing features that restrict the results for queries to be from blog pages.

But limiting the number of blogs that show up in search results doesn’t necessarily mean that a search engine doesn’t like blogs. It may mean that search engines would prefer to show a diversified set of search results, including blog pages and other results.

Ranking Algorithms

Search engines often look a couple of different kinds of ranking factors when determining the order that search results are shown to searchers.

Query-Independent and Query-Dependent

One way to classify ranking algorithms is query-dependent (or dynamic) or query-independent (or static).

Query-dependent ranking algorithms rely upon the query terms someone uses to rank pages, while query-independent look at other factors such as how important they may believe a page to be based upon things such as whether or not important pages link to that page (an example of a query-independent ranking algorithm would be PageRank).

Query-independent ranking algorithms assign a quality score to each document on the web, and can be run ahead of time. Query-dependent ranking algorithms depend upon the query used, and have to be run when a user submits a query.

Content, Usage, and Link Based Ranking Algorithms

It’s also possible to classify ranking algorithms as content-based, usage-based, and link-based.

Content-based ranking algorithms - use the words in a document to rank the document among other documents. For instance, a higher score might be assigned to a document that contains the query terms at the beginning of a document, in a prominent font, or in a certain kind of HTML element.

Usage-based ranking algorithms - may assign a score based on estimages of how often documents are viewed from looking at web proxy logs or looking at click-throughs on search engine results pages.

Link-based ranking algorithms - look at the hyperlinks between web pages to rank those pages, assigning a score to pages based upon links pointing to pages. endorsement of the page.

PageRank - an example of a query-independent link-based ranking algorithm.

The PageRank formula is often explained as follows. Consider a web surfer who is performing a random walk on the web. At every step along the walk, the surfer moves from one web page to another, using the following algorithm.

With some probability d, the surfer selects a web page uniformly at random and jumps to it; otherwise, the surfer selects one of the outgoing hyperlinks in the current page uniformly at random and follows it. Because of this metaphor, the number d is sometimes called the “jump probability,” namely the probability that the surfer will jump to a completely random page.

If the web surfer jumps with probability d and there are |V| web pages, the probability of jumping to a particular page is d/|V|. Since any page can be reached by jumping, every page is guaranteed a score of at least d/|V|. The PageRank of a particular web page is then the fraction of time that the random surfer will spend at that page.

But what if that surfer started favoring pages that were linked to by blogs a little more?

Splitting PageRank

One of the problems behind using PageRank is that some commercial web sites try to inflate PageRank by creating links that point to a page solely for the purpose of endorsing that page, artificially increasing the value of the page.

This patent filing describes in some detail how a portion of PageRank from a page might be split (or distributed) equally amongst the links found on the pages of a site, and how the distribution of PageRank could be slightly altered to favor (or show a bias towards) pages that are linked to by blogs.

If blogs are, as the authors note in the patent, “still mostly human authored, and generally represent sincere endorsements of their authors,” then this bias might help counteract the artifical inflation of PageRank scores by people who would create links pointing to pages solely for the purpose of artifically increasing the PageRank of pages.

The patent filing is:

Ranking Method using Hyperlinks in Blogs
Inventors: Steve Chien and Dennis Fetterly
Assigned to Microsoft
US Patent Application 20080243812
Published October 2, 2008
Filed March 30, 2007

Abstract

A method for static ranking of web documents is disclosed. Search engines are typically configured such that search results having a higher PageRank.RTM. score are listed first. A modified scoring technique is provided whereby the score includes a reset vector that is biased toward web pages linked to blogs. This requires identifying web pages as either blogs or non-blogs.

Identifying Blogs

Some of the kinds of things that a search engine crawling program might look at when deciding whether a page is from a blog might include:

  1. Whether a page is hosted in a known blog hosting DNS domain such as blogspot or wordpress.com
  2. What features are containted in the non-HTML markup words and phrases contained in the page
  3. What the targets of outgoing links might be in the page, and
  4. Whether the string “blog” occurs in the URL

Experimenting with a Bias Towards Pages Linked to by Blogs

The authors of this patent performed experiments where they downloaded over 472 million pages, and found links to an additional 6 Billion pages within those pages.

They reranked the PageRank of these pages using a bias towards pages that they identified were linked to by blogs, with a preference towards using blog pages that had higher PageRanks, which they tell us tend to be “frequently updated, more informational rather than personal, and free of spam.”

They also tell us that some other characteristics of blogs may prove useful in refining this technique, such as looking at the number of subscribers to a particular blog, and associating a higher endorsement value to blogs with greater numbers of subscribers.

Conclusion

Can sending more PageRank to pages that are linked to by blogs something that will increase the relevance and importance of pages that show up in search results? Are links to pages from blogs still actual endorsements from the authors of those blogs?

Do search engines love blogs?

Duplicate Content Dissected

August 4th, 2008

I’ve read seemingly hundreds of forum posts discussing duplicate content, none of which gave the full picture, leaving me with more questions than answers. I decided to spend some time doing research to find out exactly what goes on behind the scenes. Here is what I have discovered.

Most people are under the assumption that duplicate content is looked at on the page level when in fact it is far more complex than that. Simply saying that “by changing 25 percent of the text on a page it is no longer duplicate content” is not a true or accurate statement. Lets examine why that is.

To gain some understanding we need to take a look at the k-shingle algorithm that may or may not be in use by the major search engines (my money is that it is in use). I’ve seen the following used as an example so lets use it here as well.

Let’s suppose that you have a page that contains the following text:

The swift brown fox jumped over the lazy dog.

Before we get to this point the search engine has already stripped all tags and html from the page leaving just this plain text behind for us to take a look at.

The shingling algorithm essentially finds word groups within a body of text in order to determine the uniqueness of the text. The first thing they do is strip out all stop words like and, the, of, to. They also strip out all fill words, leaving us only with action words which are considered the core of the content. Once this is done the following “shingles” are created from the above text. (i’m going to include the stop words for simplicity)

The swift brown fox
swift brown fox jumped
brown fox jumped over
fox jumped over the
jumped over the lazy
over the lazy dog

These are essentially like unique fingerprints that identify this block of text. The search engine can now compare this “fingerprint” to other pages in an attempt to find duplicate content. As duplicates are found a “duplicate content” score is assigned to the page. If too many “fingerprints” match other documents the score becomes high enough that the search engines flag the page as duplicate content thus sending it to supplemental hell or worse deleting it from their index completely.
My old lady swears that she saw the lazy dog jump over the swift brown fox.

The above gives us the following shingles.
my old lady swears
old lady swears that
lady swears that she
swears that she saw
that she saw the

she saw the lazy
saw the lazy dog
the lazy dog jump
lazy dog jump over
dog jump over the
jump over the swift
over the swift brown
the swift brown fox

Comparing these two sets of shingles we can see that only one matches (”the swift brown fox“). Thus it is unlikely that these two documents are duplicates of one another. No one but google knows what the percentage match must be for these two documents to be considered duplicates, but some thorough testing would sure narrow it down ;).

So what can we take away from the above examples? First and foremost we quickly begin to realize that duplicate content is far more difficult than saying “document A and document B are 50 percent similar”. Second we can see that people adding “stop words” and “filler words” to avoid duplicate content are largely wasting their time. It’s the “action” words that should be the focus. Changing action words without altering the meaning of a body of text may very well be enough to get past these algorithms. Then again there may be other mechanisms at work that we can’t yet see rendering that impossible as well. I suggest experimenting and finding what works for you in your situation.