Hacker News new | past | comments | ask | show | jobs | submit login
The Google PageRank Algorithm in 126 Lines of Python (kraeutler.net)
50 points by hhm on Jan 30, 2008 | hide | past | favorite | 8 comments



"I was surprised by the simplicity of the math underlying the google PageRank algorithm, and the ease with which it seemed to be efficiently implementable"

It's actually not all that easy to implement this stuff efficiently when n grows to any sort of useful size. E.g. the article uses an n x n adjacency matrix, so if we take n = 1 million pages, we need 4TB just to represent it, assuming no overhead. (Adjacency lists help here.)

See http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en....

(I hacked together a script to compute this on MediaWiki wikis a while ago, by examining the DB directly -- http://collison.ie/code/pagerank.rb)


As the author of the above article, I'd like to point out that the adjacency matrix is extremely sparse. If you use a sparse representation of the link matrix (as I have done -- your claim is incorrect), its space usage will scale as O(N), and not as O(N^2) as you suggest.

By that of course I don't mean to say you can use this script to build your own google.... but the problems lie elsewhere.


On of our professors used PageRank as an example to introduce matrices. He explained the algorithm in class. (Not this particular python implemention though.)


The original algorithm, I presume... I would think they're up to millions of lines of python/c++ by now.


Next: "Linux in 62 lines of code!!!"


Bah, PG could write that in Lisp on the back of a napkin.


With arc, he'd need just a postage stamp.

http://reddit.com/r/programming/info/6710p/comments/c0312wv


A bit more for the unicode support, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: