Hacker News new | past | comments | ask | show | jobs | submit login
How Scribd runs 150,000,000 polygon intersections a day (scribd.com)
138 points by matthiaskramm on June 25, 2010 | hide | past | favorite | 35 comments



I think a lot of people (me included) just though of Scribd as a YouTube of documents -- taking advantage of unlicensed material to juice up Pagerank, and then somehow converting that into a revenue stream. It also seemed kind of annoying to launch a Flash player just to view a PDF in a now, double-scrolling window, however I'm reminded, while reading this technical post, about the Ycombinator interviewer who said: "where's the rocket science?" Clearly, we're seeing some smart developers tackle a tough problem, and for a broad audience. I think they have a bright future!


I'd like Scribd a lot more if they'd make downloading the document easy. My general process of opening a scribd link is:

1. Oh, an interesting looking article! Hopefully it will be in HTML or PDF so I can read it!

2. (browser tab freezes) Oh shit, it says "scribd.com" in the URL bar.

3. Interface finally loads; the fonts are broken, scrolling doesn't work, and it's only barely readable.

4. I begin the frantic search for the "download" button. On most sites this is easy to spot and use, but on scribd it seems to move around from day to day. Sometimes it's big and green, others it's small, white, and hidden somewhere on the page.

5. Scribd demands I log in to download; it doesn't support OpenID, and I can never remember which throwaway account/password I used, so I just register again.

6. Finally, it lets me download and read the document.

----

How about this instead? Scribd should offer a "direct link" to the PDF, and then it would be easy for users to submit these links to external websites (HN, Reddit, etc).


Or, people could just put PDF files on the web, and search engines could index them.


Not everybody has personal hosting; sites like YouTube or Scribd are nice because they'll worry about the costs of DNS, bandwidth, storage, backups, etc without any charge.


Megaupload and most similar services are easier to use for file downloading than Scribd. Which does look like a benchmark of some kind...


I disagree; Megaupload, Rapidshare, Mediafire, etc are basically black holes in the Internet. Files go in, but what comes back out is not usable.

First, most of them have wait timers. After opening their page, the user must wait 60-120 seconds before downloading.

Second, they're plastered with ads. And not Google text ads, but awful '90s-style popups / popunders / flashing seizure GIFs. Ad blockers are useless, because the sites are written to prevent the download link from working when ads aren't visible.

Third, most (all?) of them don't handle Unicode correctly. Upload a file in anything except US ASCII, and you'll be lucky if even the file extension comes across intact. For example, Преве́д.zip could come out "Ð�Ñ�евеÌ�д.zip" or "Преве́" or nearly anything else.

Lastly, sites like Rapidshare are known for imposing download limits, such as "2 files per hour" or "max 50 Kb/s". This is hugely annoying when trying to download a half-dozen 30MB files.


Depends on a site, but taking MU and going point by point:

- first - wait timers: Megaupload has a timer which takes ~1min. to count down - that's quicker for me than dealing with scribd's registration.

- second - ads: I haven't noticed, I use adblock and never seen an ad on Megaupload. Not sure what's the current status of Scribd, but disabling scripts brought out a lot of spam words years ago (maybe they don't use it anymore), so annoyance is pretty much the same (or isn't there, depending on your view)

- third - unicode: We're talking about pdf-s. They have their own tags, for example for the title which makes the name pretty much irrelevant. Unless you're downloading a book with "Author - Title.pdf" name, you're most likely going to run into "some_serial_number", "ModelNumberDescription_code" for manuals, "thesis.pdf" or some other internal convention. Actually zipping the file allows you to see the original one.

- fourth - limits: Megaupload has none that I know of and I get >200kB/s most of the times.

So it might be very subjective, but even considering all the crap download pages give us, I'd take a Megaupload-ed .pdf over Scribd any time.

+ After writing this, I got back to scribd and tried to download something - it's MUCH better than it was before. I only had to scroll through 2 pages and find a small link at the bottom and it opened a new window. Right now, for me it's only as bad as MU.


I don't really see how Scribd will be able to maintain any appeal if Chrome polishes up their in-browser PDF rendering, which basically loads instantaneously and seems to render things fine. (The HTML transmogrification is cool, but if PDFs are fast, why do I need it?)


My point was that, I don't know that they have any legitimate appeal now, but they seem to be smart, and they're building a IP portfolio in an important space, so they might be able to switch gears or sell their IP. Heck, they might see the writing on the wall and maybe that's why they're writing these technical blogs, to gain exposure for their IP and technology.


And safari does this already


JavaScript/HTML seems to have caught up to Display PostScript at last. Nice.


ok that's crazy. Is scribd becoming the new pdf (I mean that in a good way)?


The only thing I could applaud more would be a new (and simpler) cross platform document format (for asset encapsulation and offline viewing) that could be transformed to and from HTML on a whim. Scribd <-> My Screen <-> My Printer. To death with PDF.

Come to think of it HTML5/CSS3 + an open archive/compression format would suit this purpose nicely.


What is your problem with PDF?

Almost any browser other than lynx has a buitin pdf reader, and pdf can be created by just about anything.

Finally pdf is a format that allow you to ensure that no matter where it is printed, it is exactly as you wanted it to be.


The problem with pdf is that everything about it is clunky and slow. I didn't decide to groan whenever I see that something I want is trapped inside a pdf; years of annoyance just built that up as a reflex.

Edit: actually, there is an exception, and you mentioned it: printing a pdf, once you have it open, is almost always a good experience. Too bad it's the thing I want to do least often.


You've just described ePub.


I'm looking for visual and layout consistency across all mediums (screen, print, etc) and in my experience ePub isn't suited for that. My apology for the lack of clarity.


A web service replacing a document format?


Probably not until it doesn't look terrible (on Windows 7 with Chrome).


Well, that's about as many as a modern 3D game engine executes in one second, so, meh.


I wasn't going to say anything, but you're being upmodded like crazy, so...

It's a complete apples to oranges comparison. Scribd is performing quite complex logic with an emphasis on correctness not speed. I feel like you're dismissing an interesting article for the sake of an amusing one liner, and that's pretty much the antithesis of HN.


I feel like you're dismissing an interesting article for the sake of an amusing one liner, and that's pretty much the antithesis of HN.

It is a nicely-written and illustrated article, but it describes a problem that hasn't been considered 'interesting' for decades and can, in any case, be tackled with a caching scheme. Why would they need to run the same intersection logic over and over, when there's only a finite number of glyphs to render in any given document?


> but it describes a problem that hasn't been considered 'interesting' for decades

Actually, polygon operations on grids is still an active research area, with the latest papers on the topic less than two weeks old: http://www.sci.utah.edu/socg2010agenda.html

> Why would they need to run the same intersection logic over and over [...]?

We usually don't. It basically depends on the context in which the glyphs appear on the page. For a standard, say, LaTeX document consisting of mostly text and without weird graphic operations taking place around or on top of the text, a given glyph is just processed once.


Absolutely. Modern graphics card hardware can easily process a few hundred million triangles per second. In all fairness, however, that has nothing to do with polygon intersection. Drawing a triangle on the screen with a z-buffer check is something quite different from actually computing an intersection polygon from two (multiply connected) input polygons.


CamperBob's flippant comment, however, got me thinking as to whether GPUs could be used to speed up this process, even if it were a bit.. "fuzzy." Anyone with GPU chops have any opinions on this?


Some modern GPUs support double precision floats, so accuracy would not be an issue. GPUs are practically built for this kind of computation, however there are probably two things holding them back:

1) GPU development requires specific developer experience. Making performant GPU code is an even nastier and less intuitive problem than making performant CPU code. A naive implementation would be shockingly inefficient.

2) Leasing GPU hardware in a datacenter is very rare. You'd have to do it at the office, or build your own servers and install them yourself at a colo. Lots of time and effort.

Even if the GPU solution was 10x faster (it could be much more, but it depends on how much the CPU, disk, network is a bottleneck), if you're talking about reducing $50k in computer time to $5k in computer time, it's almost certainly not worth it. If you're talking about $2m to $200k, that's a completely different matter.


Today relying on a GPU solution translates to vendor lock-in, it makes you much more dependent on a particular set of hardware and even a particular OS/driver stack.


I have had a small amount of success implementing GPGPU programs using webGL. If you code your solution to webGL you are able to offload OS/driver stack concerns to the web browser implementers.


This is very interesting. Do you have links to any examples?

This could be very useful for running certain algorithms in web apps...


Here are two examples I've done. The first is matrix multiplication on the GPU. The second is a cellular automaton like simulation (the falling sand game popular a few years ago). Each incremental game state is calculated in a shader program and then used as input for the next increment.

Below I've linked github urls and two blog posts that were done about them. Unfortunately right now webGL is kind of a moving target and I doubt these examples still work out of the box. However, if you are interested in GPGPU on webGL I would encourage you to get involved as I think there are some tweaks to the standard that would make GPGPU life easier.

Matrix Mutliply http://learningwebgl.com/blog/?p=1828 http://github.com/bunions1/matrixMultiplyGpu

Falling Sand http://learningwebgl.com/blog/?p=1471 http://github.com/bunions1/fallingsand-webgl


That's really not true - we've been doing cloud GPU computing for a few years (all in OpenGL shaders and OpenCL) - we run in Windows, Linux, MacOS and on both NVIDIA and ATI.


This is exactly why I like asking naive questions on Hacker News - thanks!


We use GPUs for all of our computer vision and image processing, 10X is definitely a low-ball for what they're doing here - it would be much faster.

The important bit is at the end where they say "10% of the time is spent on polygon intersection" and the rest on image processing. We do GPU image processing on the cloud and this is (likely) where Scribd could likely obtain a significant speedup - many of our image processing operations run hundreds of times faster than on the CPU.


Hmmmm.... GPU's in the cloud anyone?


However, a modern game engine does perform a large number of polygon intersections for the purpose of collision detection. Of course, this is generally very rough for the purpose of performance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: