Archive for September, 2008

Google PageRank Updates

|

As a way of marking the recently increase to PageRank 5 of TasteKid at the last Google PageRank update a couple of days ago, I have decided to write this short article about my opinion on what these PageRank updates really are.

While most of us are familiar with the concept of PageRank, there is a certain degree of uncertainty surrounding the actual meaning of the 0 to 10 value displayed by the Google Toolbar. What does this value reflect? Why an update on this value doesn’t necessary reflect in user traffic? And, probably the most important, why is that in such a dynamic environment like the web, where it is clear that the importance of many pages (viewed as the quantity and the quality of inbound links) can change substantially in short periods of time, the Google PageRank updates are performed only once every few months?

For answering these questions we’ll have to first understand what PageRank is and how it is computed. The basic concept is probably best explained by Google itself:

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page’s value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves “important” weigh more heavily and help to make other pages “important”.

This basic concept, translated in a formula, looks like this:

meaning that the PageRank value for a page u is dependent on the PageRank values for each page v out of the set Bu (this set contains all pages linking to page u), divided by the number L(v) of links from page v.

The way this formula works makes it clear that the PageRank of a particular web page is computed using the PageRank of all the pages that link towards that specific page. Now, the PageRank of each of those pages has to be already computed in order to perform such a task, but the structure of the web doesn’t provide the possibility to establish a sequence of pages in such a way that, for every evaluated page, all the pages that link towards it have already been evaluated before. In other words, the web has cycles (and it has lots of them).

For that reason, The PageRank computations require several passes, called “iterations”, through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.

These steps can be viewed like this:

Step 0: all pages have equal PageRank.

Step 1: every page gets its PageRank computed. Note that at this point, all inbound links have the same quality, because all pages still have the same PageRank determined at step 0.

Step 2: every page gets its PageRank computed again. This time, the inbound links have different qualities, the ones determined at step 1.

These steps succeed each other indefinitely, thus creating a better and better approximation of the real (actually, theoretical) PageRank value.

There is also a dumping factor involved that prevents the inflation of PageRank. An interesting (and beautifully simple) fact is that, despite the billions of web pages from the web and all the rises and falls of thousands of websites each day, “the sum of all PageRanks is 1” [The Anatomy of a Large-Scale Hypertextual Web Search Engine, Brin, S.; Page, L.].

Considering these facts, I’ll go back to my initial questions and try to answer them based on my limited knowledge regarding this process.

Why is that the PageRank updates are performed only once every few months?

First of all, I would like to emphasize my belief that, even though PageRank updates are published every few months, they are actually happening all the time. Google constantly crawls the web, and, while doing so, also extracts links and computes PageRank using the basic idea of the algorithm described above. This process takes time though. While I’m sure Google puts a lot of effort into computing relevant PageRanks, a lot of web sites, sometimes situated in the more suburban areas of the Internet, are prone to sudden apparent shifts in PageRank caused by spikes of popularity, spamdexing or other events. Regardless of the computing power available, a certain amount of time is necessary in order to dump the effects of such events and establish a more timeless, thus relevant importance given by the web to one of its members, that is, that particular web page.

So what I’m saying is that I don’t think that the published PageRanks are a snap-shot of the actual instant PageRanks of all web sites. In order to prevent as much as possible abnormalities that happen all the time in the more juvenile areas of the web interfering with the relevancy of the published values, I think Google established that it has to perform an evaluation over a longer period of time and then publish some values that reflect in a way the average behavior of those websites in the given window.

In order to keep the consistency of the provided values, this analysis has to be preformed on the same data set. Even though this data set (Internet states) extends over a longer period of time, it has to be the same for all the web sites involved, and this is why I think that PageRank updates are performed the way they are, once every few months, for all the websites at once.

What does the toolbar PageRank value reflect?

As I was saying, in theory, the sum of all PageRanks is 1. That means that the value of real PageRank for a specific web page is most often incredibly small, and I would suspect that in practice a much grater value than 1 is used in order to save all those exponent bits of the floating point representation of real numbers (it is probably possible to automatically determine and adjust this number to an optimal value). Regardless the actual scale used for real PageRank, these values are then rescaled using (what it is thought to be) a logarithmic scale, between 0 and 10. This logarithmic scale basically means that it requires much more incoming links to get from PageRank 4 to PageRank 5 than getting from PageRank 3 to PageRank 4 (where 5, 4 and 3 are values displayed by the toolbar).

Why the logarithmic scale? Well, just think that without it, in order to have a PageRank of 1, your site should have 1/10 from the PageRank of google.com (google.com has a PageRank of 10). This would mean that the majority of web sites would have a displayed page rank of 0 and this would have made the option of displaying the PageRank useless.

So, in my opinion, the toolbar PageRank reflects the logarithmic scaled average (actually, more complex measurements techniques are probably used than preforming a simple average) of the instant PageRank values for a given period of time in the past.

Why an update on this value doesn’t necessary reflect in user traffic?

Although the SERP performances of a web page are obviously driven by PageRank, there is no reason for the publishing of the historical PageRank behavior of that particular page for the last period of time to influence these performances. The Google PageRank updates are just passive reports, and, if a site receives and increase in PageRank on such an update, it has most probably already gradually felt that increase in SERP terms and thus in traffic.

Major Change in Taste Kid’s Results

|

Sixteen years old Ian wrote to Emmy a few hours ago:

What happened with the recommendations? They used to be so much better. Whenever I would look for Forrest Gump recommendations my 4 other favorite movies came up (The Shawshank Redemption, Gladiator, Braveheart, The Green Mile) on the list, proving that the system you were using before was working since those are my 5 favorite movies. Now only The Green Mile came up and movies like The Terminal and Cast Away are up top. Did you change how you did this? If so, the old way worked much better.

I would like to publicly answer to this feedback and explain a little bit what is happening.

Dear Ian, first of all I would like to thank you for using this service and, moreover, for proving me with this feedback. Yes, you are right, a major change has happened, a couple of days ago. You see, Taste Kid’s main goal is to be a discovery engine, to help people explore their taste by finding out about new bands, artists, movies and books. Many people that are using Emmy (including myself) felt that the suggestions where becoming more and more oriented towards popular stuff. Your personal example is great for that matter: you where searching for “Forrest Gump”, and you where given recommendations like The Shawshank Redemption, Gladiator, Braveheart and The Green Mile. Although these are all great movies, it is very unlikely that you haven’t already seen them. I was myself searching for some of my favorite bands or movies, and even though the recommendations where good, I was rarely discovering something new.

To give an example, check the results for Metallica:

http://www.tastekid.com/ask?q=metallica (new/current way)

http://www.tastekid.com/ask?q=metallica&old=1 (old way)

As you can see, using the old approach, the second recommendation for “Metallica” was “Nirvana”. Now, I’m sure people trying to discover new bands somewhat similar to “Metallica” aren’t looking for “Nirvana”; whether they like this band or not, they have most certainly already heard of it.

Given all these, I have decided to make a change. I have changed the formula that determines the relevancy of each result in a way that encourages less popular items to achieve good scores. This is a big gamble for Taste Kid. Up until now, people found it hard to disagree with the results, but, in the same time, they where rarely discovering something new. Now, by promoting less popular items, there is a much bigger chance to screw-up and to get reactions like yours. But I think it’s worth it. Since I’ve performed this change, I have personally discovered several interesting bands and movies that I haven’t heard of up until now. I’m sure there are even better ways of computing this relevancy, but I feel that the new formula is a step forward.

So give it some faith and play with it a little. While the new results have a bigger chance of containing things that you don’t like, in the same time, there is a bigger chance of finding a few things that you will like, and you haven’t heard of before. And this is ultimately Emmy’s goal :)

Tooltip Update, YouTube API

|

I’ve just uploaded a few changes regarding the tooltip that appears when you hover the “?” icons near Emmy’s results.

First of all, the script behind this tooltip now makes use of the YouTube Data API Protocol for retrieving relevant videos for each band or movie. Yes, I admit, I should have done this in the first place instead of parsing the HTML of the search results page. Besides being more elegant and definitely faster, this new approach solves a problem that I was confronting with for quite a while: retrieving only embeddable movies. Some of the movies that Emmy was showing, if played, where displaying the “This movie is no longer available” message. I hated this message, and I’m sure some of the users found it very annoying, too. This was due to the fact that the YouTube user that uploaded that movie specified that the movie is not embeddable (cannot be embed on other sites). Using the “format” custom parameter for performing the search with the Data API enabled me to search only for movies that CAN be embed (this option isn’t available for searches performed on the YouTube site).

Another thing is that the tooltip no longer disappers when you move the mouse out of it, as it has been designed to do until now. You have to click on the “Close [X]” link in order to close it. This way, a user can watch and listen to a music video/movie trailer while doing other things with his/her mouse, instead of having to keep it in the tooltip area.

I hope you will find these updates helpful :)

P.S.: I would also recommend SimplePie for parsing Atom feeds. Although I’ve ran into a couple of problems, it finally worked out like a charm.

The Search Race

|

Taste Kid has been included in The Search Race. Every single pick is higly appreciated :)

Google’s Perception

|

In terms of usual website content, the one that Google appreciates, Taste Kid is a disaster. Not only it has tens of thousands of pages that all look alike, but the content of these pages are nothing more than a list of internal links (the suggested items). I can’t blame Google if it finds that suspicious, as I’m sure that its bots find it hard to determine the value of these pages. One of my biggest fears was that Google will permanently consider Taste Kid as a sort of link farm, trying to gain page rank by having lots of pages that link randomly to each other (yes, I do think that a page never has a page rank value of 0, and, to a certain extent, having many pages that link to one another will increase your overall page rank, but that’s another discussion).

Luckily for me, Google hasn’t been that drastic. Despite the lack of classic original content, it constantly crawls and indexes Taste Kid’s pages. I suppose, after all, the very enumeration of resources (bands, movies, books), that is unique for every page, can be seen as a type of original content, and I’m glad Google perceives it that way. I just hope it won’t change its opinion one day.