One trillion edges: graph processing at Facebook scale [pdf]

Smerity · on Aug 13, 2015

If anyone is interested in graph processing at scale, the largest Web Data Commons Hyperlink Graph[1] (created from Common Crawl[2] data) is 3.5 billion web pages and 128 billion hyperlinks. To my knowledge, it's the largest freely available real world based graph dataset.

Numerous graph systems have been made that can scale to this data, frequently on just a single powerful node, including FlashGraph[3] (using SSDs for high), Frank McSherry's single laptop Rust implementation[4], and Dato's GraphLab Create[5].

As Frank McSherry points out at [6], there's a lot of room to improve state of the art in graph computation systems, especially as the current academic evaluations are broken.

(where "broken" means that a single threaded graph system written in Rust on a laptop can beat the benchmarks created by a cluster of high performance machines)

[1]: http://webdatacommons.org/hyperlinkgraph/

[2]: http://commoncrawl.org/

[3]: https://github.com/icoming/FlashGraph

[4]: http://www.frankmcsherry.org/graph/scalability/cost/2015/02/...

[5]: https://twitter.com/CommonCrawl/status/623615774909857792

[6]: http://blog.commoncrawl.org/2015/04/evaluating-graph-computa...

shoo · on Aug 13, 2015

That post from Frank McSherry is excellent. His earlier post [1] is also worth reading.

I was curious to read more about the Hilbert ordering of edges, and the delta-encoding. There is a related article from Yzelman & Bisseling titled "A cache-oblivious sparse matrix–vector multiplication scheme based on the Hilbert curve"[2] which discusses Hilbert ordering combined with a delta-encoded version of the usual CSR encoding for sparse matrices [3].

> A Hilbert curve thus can be projected on any [2^m * 2^m] matrix, which in turn can embed the sparse matrix A, thus imposing a 1D ordering on its nonzeroes.

The largest sparse matrix considered for the experimental results from this article was "wikipedia-20070206" with 45,030,389 nonzeros (aka weighted edges). Ignoring the weights, this is about 1/3000th of the size of the 128 billion edge graph that McSherry mentions in his post.

Yzelman & Bisseling also measure the additional preprocessing time required to convert their sparse matrices using the Hilbert-ordered delta-encoded CSR scheme -- this is roughly 30x the time for a single matrix-vector product evaluation [4]. Clearly this approach is only going to pay off if one plans to do a bunch of analysis using the single matrix / graph, without changing the matrix / graph. McSherry mentions a similar issue but points out that he could preprocess the graph using his approach faster than he could download the data, so the preprocessing speed wasn't really an issue.

[1] http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...

[2] https://lirias.kuleuven.be/bitstream/123456789/400709/1/yzel...

[3] sparse matrices are a fairly obvious representation of directed graphs with edge weights, I searched for an article about sparse matrices as I am more familiar with linear algebra than graph processing.

[4] edit: perhaps an crude intuitive way to think about "30x matvec products" in the context of graph algorithms would be "about one pagerank", as each pagerank iteration is doing a matvec product and a bit of other stuff, and McSherry talks about 10 or 20 iterations per pagerank evaluation.

Smerity · on Aug 13, 2015

Agreed re: the quality of Frank McSherry's posts - they're all great fun and push the technical envelope!

I'm glad you brought up Hilbert ordering. It's fascinating but I rarely see it. I usually see Z-order curves in their place - simpler and more intuitive but also less effective. Does anyone know of large scale databases with Hilbert ordering out of the box?

For anyone who hasn't seen them before, or why they help in ordering, I highly recommend "Portrait of the Hilbert curve"[1] and "Spatial indexing with Quadtrees and Hilbert Curves"[2]. The tldr is that they enable efficient multi-dimensional indexing. Given a request like "X < 40 and Y > 80", most databases need to run "X < 40" and then filter over the results for "Y > 80" - Hilbert curves allow you to find the query's intersection and only then perform minor filtering.

As far as large scale use of Hilbert ordering, Google use it internally for much of their geographical work. They even open sourced a Java library specialised in multi-dimensional indexing based on Hilbert curves called Uzaygezen[3].

[1]: http://corte.si/%2Fposts/code/hilbert/portrait/index.html

[2]: http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-...

[3]: http://google-opensource.blogspot.com/2008/08/uzaygezen-mult...

jandrewrogers · on Aug 13, 2015

The primary advantage of Hilbert curves relative to Morton (Z) curves is that they are marginally more efficient at linearizing multidimensional point indexes e.g. storing on spinning disk.

However, the reason that Hilbert curves are relatively rare in real systems these days is that (1) the optimality of Hilbert curves is largely based on the assumption of indexing on sequential storage media, and (2) for the many other applications of structures describable by space-filling curves, a Hilbert description is often inferior in practice. For example, it is a suboptimal curve to use if you are implementing spatial analysis operators in a distributed system.

If you were designing a spatial indexing system from scratch today for modern applications, you would use neither Hilbert nor Morton/Z curves but they are adequate shortcuts for sufficiently narrow applications. Substantially more optimal curves also have a much more complex description.

cbsmith · on Aug 13, 2015

What would you go with for spatial indexing today? R*-tree? k-d tree? BVH?

Xyik · on Aug 13, 2015

I thought facebook was located at 1 Hacker Way not Hacker Lane

anotherangrydev · on Aug 13, 2015

>we were able to execute PageRank on over a trillion social connections in less than 3 minutes per iteration with only 200 machines

Colour me skeptical, that's around 28M edges per second per machine.

frankmcsherry · on Aug 13, 2015

By way of comparison, the single-threaded numbers for PageRank on the `twitter_rv` graph on my laptop are 1.5B edges in 5s, which is about 300M edges per second per core.

PageRank isn't doing much other than a load, a few += operations, and a store. The main reason it is slow is because memory is far away, but if you lay out the edges intelligently your rank data is usually in the L3 and then computers go fast.

Smerity · on Aug 13, 2015

Other posters have replied giving a back of the envelope as to feasibility, so from the other end:

The PageRank implementation on Dato's GraphLab Create when run on the Web Data Commons Hyperlink Graph (128 billion edges) does 3 billion edges a second on 16 nodes, which is 187 million edges per second per machine.

Given that communication overhead quickly becomes an issue for most of these systems and their graph is dealing with more edges, 28 million edges per second per machine seems quite reasonable.

[1]: https://twitter.com/CommonCrawl/status/623615774909857792

cbsmith · on Aug 13, 2015

Let's reframe that as ~1700 instructions per edge.

Seem more plausible?

These are 16 core machines with 10GbE links. Each core is processing maybe 2 million edges per second with highly data parallel instructions. Ignoring hyperthreading, if these are 3GHz cores (pretty conservative), that means you are burning ~1500 cycles per edge to process an edge. You've got 2 FP ALU's and 2 integer ALU's in each core, not to mention the AGU. You've got prefetchers and completion units that are designed to handle 4-6 instructions per cycle, and that's not for no reason. You can get a lot of work done in one core with just 1000 cycles, let alone 1500.

So I'm not so sure why that would seem incredible.

MichaelGG · on Aug 13, 2015

Not really on topic, but compare those numbers with the slowness of RAM and cache. A single shared variable that gets updated on one socket then another -- that can cost like 300 cycles, at least on Nehalem.

cbsmith · on Aug 13, 2015

In terms of throughput, the RAM is great. It's the latency thing. Which is why intelligent scheduling with embarrassingly parallel problems like this makes such a huge difference.

throwaway6497 · on Aug 13, 2015

Will appreciate if you can explain how this was reframed to ~1700 instructions/edge?

cbsmith · on Aug 13, 2015

At ~1500 clock cycles, even a relatively conservative 1.2 instructions per clock per core is enough.

anotherangrydev · on Aug 13, 2015

Their test machines have 16-cores. They're running 2.9Gops/core. Plausible, yeah. I'm not used to see real things achieve almost 100% efficiency, sounds really good.

cbsmith · on Aug 13, 2015

To provide perhaps more perspective, here's a paper using 2004 technology: http://www.leonidzhukov.net/papers/ParallelPageRank-2004.pdf

With 70 machines & 140 processors, they did pagerank over 6.6 billion edges in 35 minutes. Using 285% the machines, you'd expect them to be able to do it in 12.25 minutes. Moore's Law should put each machine at 160x more powerful, and with 300x the edges they are finishing in 19 minutes. So we've improved by 2x over an 11 year-old software stack that we've long since abandoned for better approaches, using a hardware stack that is much more optimized for SIMD computations than the 11 year-old hardware.

anotherangrydev · on Aug 13, 2015

I'm reading that one, thanks.

I really dont know the PageRank algorithm, but I was really impressed by their results, a trillion of "something" in 3 minutes. I thought that just reading things should take you a lot more than that but apparently not.

Really nice stuff.

cbsmith · on Aug 13, 2015

Data parallel problems are essentially the use case that should allow you to get to 100% efficiency. The hardware/software just aren't going to get it easier than that. In this case, I'd speculate that you'd not even need 50% efficiency to get this result.

logicrime · on Aug 13, 2015

I kinda hate these papers that just humble-brag clusterized setups without providing any abstract insights. This doesn't bring me any closer to understanding graph data any better, but I'm now ready to begin installation of a multi-million-dollar cluster of machines and storage.

The bit about k-means was interesting, but the rest was an irrelevant bore.

Hytosys · on Aug 13, 2015

Imagine being one of the five authors of this paper, browsing this comments section. It's natural to distance yourself from people when you're behind a keyboard, but for fuck's sake — these are your peers. If you're going to be critical, do it with attention and care.

lcswi · on Aug 13, 2015

The people were not critized, the paper was.

logicrime · on Aug 13, 2015

If you think the authors give half a shit what HackerNews think, you've missed the point.

cbsmith · on Aug 13, 2015

The purpose of the paper isn't to help you to understand graph data better. Heck, this is a VLDB presentation.

It is to help you to understand frameworks for working on graph data better.

If you see it as a humble brag that is an irrelevant bore, you aren't the intended audience.

nl · on Aug 13, 2015

Given than this is a paper for VLDB[1] (the conference on Very Large Databases) perhaps there is some small chance that your personal judgement of what is an insight or relevant is... wrong?

Also, your idea that it is a humble-brag to talk about a computer cluster at VLDB seems more indicative of a lack of knowledge on your part rather than a lack of "abstract insights".

Irrelevant bore indeed.

[1] http://www.vldb.org/2015/

chimtim · on Aug 13, 2015

VLDB accepts 150 papers and SIGMOD perhaps another 150. This is just top tier. I am pretty sure science can live without about more than 50% of those papers.

I disagree with the tone of the original comment. But I do not disagree with the sentiment. Just having a large installation does not make it interesting. However, Google's large systems almost always push the boundaries of science -- MapReduce, GFS, Spanner, Distbelief and have been a joy to read.