Posted in Lessons Learned, Milestones, Techy on September 28th, 2008 by Andrei Oghina
As a way of marking the recently increase to PageRank 5 of TasteKid at the last Google PageRank update a couple of days ago, I have decided to write this short article about my opinion on what these PageRank updates really are.
While most of us are familiar with the concept of PageRank, there is a certain degree of uncertainty surrounding the actual meaning of the 0 to 10 value displayed by the Google Toolbar. What does this value reflect? Why an update on this value doesn’t necessary reflect in user traffic? And, probably the most important, why is that in such a dynamic environment like the web, where it is clear that the importance of many pages (viewed as the quantity and the quality of inbound links) can change substantially in short periods of time, the Google PageRank updates are performed only once every few months?
For answering these questions we’ll have to first understand what PageRank is and how it is computed. The basic concept is probably best explained by Google itself:
PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page’s value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves “important” weigh more heavily and help to make other pages “important”.
This basic concept, translated in a formula, looks like this:
meaning that the PageRank value for a page u is dependent on the PageRank values for each page v out of the set Bu (this set contains all pages linking to page u), divided by the number L(v) of links from page v.
The way this formula works makes it clear that the PageRank of a particular web page is computed using the PageRank of all the pages that link towards that specific page. Now, the PageRank of each of those pages has to be already computed in order to perform such a task, but the structure of the web doesn’t provide the possibility to establish a sequence of pages in such a way that, for every evaluated page, all the pages that link towards it have already been evaluated before. In other words, the web has cycles (and it has lots of them).
For that reason, The PageRank computations require several passes, called “iterations”, through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.
These steps can be viewed like this:
Step 0: all pages have equal PageRank.
Step 1: every page gets its PageRank computed. Note that at this point, all inbound links have the same quality, because all pages still have the same PageRank determined at step 0.
Step 2: every page gets its PageRank computed again. This time, the inbound links have different qualities, the ones determined at step 1.
…
These steps succeed each other indefinitely, thus creating a better and better approximation of the real (actually, theoretical) PageRank value.
There is also a dumping factor involved that prevents the inflation of PageRank. An interesting (and beautifully simple) fact is that, despite the billions of web pages from the web and all the rises and falls of thousands of websites each day, “the sum of all PageRanks is 1” [The Anatomy of a Large-Scale Hypertextual Web Search Engine, Brin, S.; Page, L.].
Considering these facts, I’ll go back to my initial questions and try to answer them based on my limited knowledge regarding this process.
Why is that the PageRank updates are performed only once every few months?
First of all, I would like to emphasize my belief that, even though PageRank updates are published every few months, they are actually happening all the time. Google constantly crawls the web, and, while doing so, also extracts links and computes PageRank using the basic idea of the algorithm described above. This process takes time though. While I’m sure Google puts a lot of effort into computing relevant PageRanks, a lot of web sites, sometimes situated in the more suburban areas of the Internet, are prone to sudden apparent shifts in PageRank caused by spikes of popularity, spamdexing or other events. Regardless of the computing power available, a certain amount of time is necessary in order to dump the effects of such events and establish a more timeless, thus relevant importance given by the web to one of its members, that is, that particular web page.
So what I’m saying is that I don’t think that the published PageRanks are a snap-shot of the actual instant PageRanks of all web sites. In order to prevent as much as possible abnormalities that happen all the time in the more juvenile areas of the web interfering with the relevancy of the published values, I think Google established that it has to perform an evaluation over a longer period of time and then publish some values that reflect in a way the average behavior of those websites in the given window.
In order to keep the consistency of the provided values, this analysis has to be preformed on the same data set. Even though this data set (Internet states) extends over a longer period of time, it has to be the same for all the web sites involved, and this is why I think that PageRank updates are performed the way they are, once every few months, for all the websites at once.
What does the toolbar PageRank value reflect?
As I was saying, in theory, the sum of all PageRanks is 1. That means that the value of real PageRank for a specific web page is most often incredibly small, and I would suspect that in practice a much grater value than 1 is used in order to save all those exponent bits of the floating point representation of real numbers (it is probably possible to automatically determine and adjust this number to an optimal value). Regardless the actual scale used for real PageRank, these values are then rescaled using (what it is thought to be) a logarithmic scale, between 0 and 10. This logarithmic scale basically means that it requires much more incoming links to get from PageRank 4 to PageRank 5 than getting from PageRank 3 to PageRank 4 (where 5, 4 and 3 are values displayed by the toolbar).
Why the logarithmic scale? Well, just think that without it, in order to have a PageRank of 1, your site should have 1/10 from the PageRank of google.com (google.com has a PageRank of 10). This would mean that the majority of web sites would have a displayed page rank of 0 and this would have made the option of displaying the PageRank useless.
So, in my opinion, the toolbar PageRank reflects the logarithmic scaled average (actually, more complex measurements techniques are probably used than preforming a simple average) of the instant PageRank values for a given period of time in the past.
Why an update on this value doesn’t necessary reflect in user traffic?
Although the SERP performances of a web page are obviously driven by PageRank, there is no reason for the publishing of the historical PageRank behavior of that particular page for the last period of time to influence these performances. The Google PageRank updates are just passive reports, and, if a site receives and increase in PageRank on such an update, it has most probably already gradually felt that increase in SERP terms and thus in traffic.


