Houston Texas Web Site Design Hosting and Cold Fusion Development
  Wednesday, January 07, 2009
  WebWize Services
Houston Web Site Design >
Houston Web Site Hosting >
Cold Fusion Development >
Internet Audio/Video Streaming >
Search Engine Optimization >
Domain Registration Services >
  Email / Spam Solutions
Email Solutions >
Spam Firewall Features >
Spam Firewall Screenshots >
  Support
Email Log-in >
Stats Log-in >
Spam Quarantine Log-in >
Email Configuration >
Do-it-yourself Meta Tags >
Web Site & Search Engine Tips >
  Company
Client Partners/Portfolio >
What WebWize Does >
Why do Biz with Us? >
Inquiries >
Email WebWize >
Houston Web Site Design Home >
  Contact Information
WebWize, Inc.
1006 W. 42nd St.
Houston, Texas 77018
713.416.7111
713.688.4382.F
info@webwize.com



The Google Pagerank Algorithm and How It Works
Ian Rogers, IPR Computing Ltd.

ian@iprcom.com

Introduction
Page Rank is a topic much discussed by Search Engine Optimisation (SEO) experts. At the heart of PageRank is a mathematical formula that seems scary to look at but is actually fairly simple to understand.

Despite this many people seem to get it wrong! In particular “Chris Ridings of www.searchenginesystems.net” has written a paper entitled “PageRank Explained: Everything you’ve always wanted to know about PageRank”, pointed to by many people, that contains a fundamental mistake early on in the explanation! Unfortunately this means some of the recommendations in the paper are not quite accurate.

By showing code to correctly calculate real PageRank I hope to achieve several things in this response:

Clearly explain how PageRank is calculated.
Go through every example in Chris’ paper, and add some more of my own, showing the correct PageRank for each diagram. By showing the code used to calculate each diagram I’ve opened myself up to peer review - mostly in an effort to make sure the examples are correct, but also because the code can help explain the PageRank calculations. Describe some principles and observations on website design based on these correctly calculated examples. Any good web designer should take the time to fully understand how PageRank really works - if you don’t then your site’s layout could be seriously hurting your Google listings!

[Note: I have nothing in particular against Chris. If I find any other papers on the subject I’ll try to comment evenly]

How is PageRank Used?
PageRank is one of the methods Google uses to determine a page’s relevance or importance. It is only one part of the story when it comes to the Google listing, but the other aspects are discussed elsewhere (and are ever changing) and PageRank is interesting enough to deserve a paper of its own.

PageRank is also displayed on the toolbar of your browser if you’ve installed the Google toolbar (http://toolbar.google.com/). But the Toolbar PageRank only goes from 0 – 10 and seems to be something like a logarithmic scale:
Toolbar PageRank(log base 10)
Real PageRank
0 : 0 - 10

1 : 100 - 1,000

2 : 1,000 - 10,000

3 : 10,000 - 100,000

4 : and so on...

We can’t know the exact details of the scale because, as we’ll see later, the maximum PR of all pages on the web changes every month when Google does its re-indexing! If we presume the scale is logarithmic (although there is only anecdotal evidence for this at the time of writing) then Google could simply give the highest actual PR page a toolbar PR of 10 and scale the rest appropriately.

Also the toolbar sometimes guesses! The toolbar often shows me a Toolbar PR for pages I’ve only just uploaded and cannot possibly be in the index yet!

What seems to be happening is that the toolbar looks at the URL of the page the browser is displaying and strips off everything down the last “/” (i.e. it goes to the “parent” page in URL terms). If Google has a Toolbar PR for that parent then it subtracts 1 and shows that as the Toolbar PR for this page. If there’s no PR for the parent it goes to the parent’s parent’s page, but subtracting 2, and so on all the way up to the root of your site. If it can’t find a Toolbar PR to display in this way, that is if it doesn’t find a page with a real calculated PR, then the bar is greyed out.

Note that if the Toolbar is guessing in this way, the Actual PR of the page is 0 - though its PR will be calculated shortly after the Google spider first sees it.

PageRank says nothing about the content or size of a page, the language it’s written in, or the text used in the anchor of a link!

Definitions
I’ve started to use some technical terms and shorthand in this paper. Now’s as good a time as any to define all the terms I’ll use:

PR: Shorthand for PageRank: the actual, real, page rank for each page as calculated by Google. As we’ll see later this can range from 0.15 to billions.

Toolbar PR: The PageRank displayed in the Google toolbar in your browser. This ranges from 0 to 10.

Backlink: If page A links out to page B, then page B is said to have a “backlink” from page A.

That’s enough of that, let’s get back to the meat…

So what is PageRank?
In short PageRank is a “vote”, by all the other pages on the Web, about how important a page is. A link to a page counts as a vote of support. If there’s no link there’s no support (but it’s an abstention from voting rather than a vote against the page).

Quoting from the original Google paper, PageRank is defined like this:
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.

PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web.

but that’s not too helpful so let’s break it down into sections.

PR(Tn) - Each page has a notion of its own self-importance. That’s “PR(T1)” for the first page in the web all the way up to “PR(Tn)” for the last page
C(Tn) - Each page spreads its vote out evenly amongst all of it’s outgoing links. The count, or number, of outgoing links for page 1 is “C(T1)”, “C(Tn)” for page n, and so on for all pages.
PR(Tn)/C(Tn) - so if our page (page A) has a backlink from page “n” the share of the vote page A will get is “PR(Tn)/C(Tn)” d(... - All these fractions of votes are added together but, to stop the other pages having too much influence, this total vote is “damped down” by multiplying it by 0.85 (the factor “d”)
(1 - d) - The (1 – d) bit at the beginning is a bit of probability math magic so the “sum of all web pages' PageRanks will be one”: it adds in the bit lost by the d(.... It also means that if a page has no links to it (no backlinks) even then it will still get a small PR of 0.15 (i.e. 1 – 0.85). (Aside: the Google paper says “the sum of all pages” but they mean the “the normalised sum” – otherwise known as “the average” to you and me.

How is PageRank Calculated?
This is where it gets tricky. The PR of each page depends on the PR of the pages pointing to it. But we won’t know what PR those pages have until the pages pointing to them have their PR calculated and so on… And when you consider that page links can form circles it seems impossible to do this calculation!

But actually it’s not that bad. Remember this bit of the Google paper:

PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web.

What that means to us is that we can just go ahead and calculate a page’s PR without knowing the final value of the PR of the other pages. That seems strange but, basically, each time we run the calculation we’re getting a closer estimate of the final value. So all we need to do is remember the each value we calculate and repeat the calculations lots of times until the numbers stop changing much.

original article


Cold Fusion Driven, PowerEdge Served


Links and Resorces

© Copyright 1994 - 2009     WebWize, Inc.   All Rights Reserved


Web Hosting through Texas Web Hosting


houston web design firm specializing in web site design and development as well as hosting