PageRank, Google’s Claim to Fame

The secret to Google’s uncanny ability to return the most relevant results to any search query is an algorithm called PageRank. Developed by Larry Page and Sergey Brin (the founders of Google, duh), PageRank is supposedly a pun base upon the last name of one of its founding fathers. Regardless, the algorithm does what its name implies, ranks pages by relevancy.

PageRank is essentially an evolved form of a few lesser “tricks” that combine to achieve one desired function.

The Hyperlink Trick

Hyperlink Trick

“The basis of the hyperlink trick. Six web pages are shown, each represented by a box. Two of the pages are scrambled egg recipes, and the other four are pages that have hyperlinks to these recipes. The hyperlink trick ranks Bert’s page above Ernie’s, because Bert has three incoming links and Ernie only has one.”

The Authority Trick:

Authority Trick

“The basis for the authority trick. Four web pages are shown: two scrambled egg recipes and two pages that link to the recipes. One of the links is from the author of this book (who is not a famous chef) and one is from the home page of the famous chef Alice Waters. The authority trick ranks Bert’s page above Ernie’s, because Bert’s incoming link has greater “authority” than Ernie’s.”

Well that’s all fine and dandy, but how does my computer know that Bert is any greater an authority than Ernie? They are, after all, both puppets…

The Hyperlink Authority “Trick”:

This particular “Trick” isn’t technically a trick at all, but rather an improved version of both. In order to determine the result of highest authority, each web page is assigned an initial score of 1. If a page has hyperlinks pointing to it, their respective scores are combined to produce an overall authority value for that page.

Authority + Hyperlink

“A simple calculation of ‘authority scores’  for the two scrambled egg recipes. The authority scores are shown in circles.”

The Hyperlink + Authority Problem:

Guess what? Another problem. You see, the previous algorithm appears to work without any need for the computer to know the contents of a web page, but there exists a particular phenomenon called a “cycle”, in which one returns to their starting point after consecutively clicking a series of links.

Cycle

“An example of a cycle of hyperlinks. Pages A, B, and E form a cycle because you can start at A, click through to B, then E, and then return to your starting point at A.”

How do we fix it?

The Random Surfer Trick

In this instance, a “random surfer” visits an entirely random web page anywhere on the internet. He scans the page for hyperlinks and selects one at random. He repeats the process on the resulting page, and so on, and so forth.

Random Surfer

“The random surfer model. Pages visited by the surfer are darkly shaded, and the dashed arrows represent random restarts. The trail starts at page A and follows randomly selected hyperlinks interrupted by two random restarts [dashed lines].”

In the random surfer model their exists one opportunity for failure. There is a small percent chance (ex: 15%) that the user becomes bored of a particular page, and restarts on another completely random webpage. Illustrated below, the computer counts each time a particular website is visited in a group of 16. In both cases, page D accrues the highest number/percent of hits.

PageRank

“Random surfer simulations. Top: Number of visits to each page in a 1000-visit simulation. Bottom: Percentage of visits to each page in a simulation of one million visits.”

BTWPRchecker allows you to check your site’s PageRank.

How do all of these seemingly different algorithms come together to produce Google’s PageRank? 

The values produced by the random surfer trick are exactly what the hyperlink and authority tricks require. Page had the most hits in both cases. Why? Page D had the highest number of incoming links (hyperlink trick) and the highest number of incoming popular links (authority trick), thus the surfer found himself returning to D more than any other page (because there are fewer links leading to them).

Random Surfer

“Surfer authority scores for the scrambled egg example. Bert and Ernie each have exactly one incoming link conferring authority on their pages, but Bert’s page will be ranked higher in a web search query for “scrambled eggs.”

Summary:

“…the random surfer model simultaneously incorporates both the hyperlink trick and authority trick. In other words, the quality and quantity of incoming links at each page are all taken into account. Page B [above] demonstrates this: it receives its relatively high score (10%) due to three incoming links from pages with moderate scores, ranging from 4% to 7%.”

There you have it. The Random Surfer Trick returns the most popular page whether a cycle exists or not.

If you’re still confused, or simply want to read more about Google’sPageRank, check out Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today’s Computers, by John MacCormick. It provides a more in-depth look at this, and eight other extraordinary algorithms behind every piece of technology we use today. The paperback is going for about $8 used on Amazon.

9 Algorithms

Excerpts from: MacCormick, John. “PageRank: The Technology That Launched Google.” Nine Algorithms That Changed The Future: The Ingenious Ideas That Drive Today’s Computers. Princeton: Princeton UP, 2012. 39+. Print.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s