Directed Edge (wheels' startup) launches public beta, finds related articles in Wikipedia

randomwalker · on Aug 13, 2008

"We’ve got a super-fast in-house graph storage system that makes it possible to do interesting stuff with graphs quickly, notably figure out which pages are related."

I'm working with large graphs and I find that using a relational database as a backend is horribly slow if you want to run somewhat complex graph algorithms. Looks like everyone who does this ends up developing an in-house system. Anyone know if there's a library out there for doing this sort of thing? The order of magnitude I'm talking about is ~10^8 nodes, ~10^9 edges.

fizx · on Aug 13, 2008

I think your best bet is an RDF tool like Jena or Sesame. 10^9 is pushing these engines though. There's an HBase-SPARQL lib, but that's gotta be two years away from stable.

For real world problems, I tend to write custom code that's Java NIO heavy. Try to pack the data as efficiently as possible, and minimize disk seeks.

wheels · on Aug 13, 2008

The tools that you mention are honestly not just one, but a couple orders of magnitude too slow to do interesting applications on at this scale.

michael_dorfman · on Aug 13, 2008

Count me as one of those people more interested in cool graph theory stuff than Wikipedia. What else have you got up your sleeves?

Also, while we're talking about Wikipedia-- any plans to play around with "shortest path" algorithms? For example, how few pages to I have to navigate to get from, say, "Turing machines" to "Miles Davis"?

wheels · on Aug 13, 2008

Wikipedia's just a big data set to practice on. I've been musing on information graphs (and doing talks about such) since 2004 or so.

It also happens that I've done a bit of work in writing fast, embedded databases, so when I realized that none of the off-the-shelf graph databases that I found were fast enough for our needs, and that mapping graphs to an SQL database was abysmally slow, I wrote a small one. Honestly, I was surprised when I discovered some graph databases that advertise being an order of magnitude slower.

I'd love for our product to be a graph database, but I'm convinced that product takes too long to get to the market. I'd need probably a year of just concentrating on that, without, say, also having to do business stuff just to get that ready for general purpose use. It'd also be cooler for geeking out to, but I don't know if people would actually pay for it.

So right now we've got a database that's really fast at: traversing lots of edges and filtering on tags. That's what we need for finding related stuff.

As for shortest path stuff ... well, it's never had much more than novelty value for me. I don't see much of a business case for implementing it, and well, we're a business.

michael_dorfman · on Aug 13, 2008

I agree about the shortest path stuff being purely novelty stuff-- but I must admit it was the first thing that came to my mind when I tried to connect Wikipedia with graph theory.

Personally, I like playing around with graph theory, but I have yet to come up with a compelling business case-- I wish you luck, and hope you find something cool (and lucrative.) I'd probably be a customer of a fast graph database, but I doubt there are a lot of other graph theory hobbyists out there.

wheels · on Aug 13, 2008

PageRank was a graph algorithm that seems to have done pretty well. ;-)

michael_dorfman · on Aug 14, 2008

Touché.

timb · on Aug 13, 2008

3 clicks: Turing machine -> Assembly language -> 1950s -> Miles Davis

you can find shortest paths here: http://www.netsoc.tcd.ie/~mu/wiki/

mattmcknight · on Aug 13, 2008

It's a nice feature for browsing Wikipedia aimlessly. I'd be more interested in an article about this: "We’ve got a super-fast in-house graph storage system that makes it possible to do interesting stuff with graphs quickly, notably figure out which pages are related.." Sounds cool, how does it work?

mrjbq7 · on Aug 14, 2008

You have a hugely improved layout and page design than the original wikipedia, kudos.

And the related links feature makes ad-hoc browsing of wikipedia entries quite useful, again kudos.

rst · on Aug 14, 2008

Hmmm... simple stuff turned up mostly topical, so I tried Pierre Herme, a famous French pastry chef (at least if you follow French pastry). The top result was a stub article for a French immunologist; others include a musician and a TV newscaster.

wheels · on Aug 14, 2008

There are almost no links going in or out of his page, and since our algorithm is mostly, as our name implies, graph based that means that there's not much we can do there currently. He's the rare case of a reasonably well written article that doesn't really use Wiki markup. I started to say, "We'll be adding the French Wikipedia soon, it should be better there." But then I saw that the English entry is just a translation of the French one, with the same lack of links.

We'll be adding other techniques to our analysis over time though, this is just our first step...

llimllib · on Aug 13, 2008

So... I load up a page, and the benefit is supposed to be that I get 10 of the links from it on the sidebar as "related links".

Why do I care, exactly? What's the value here?

wheels · on Aug 13, 2008

The idea is that it's like Amazon recommending books or Last.fm recommending music (though the technique is quite different). In practice, once you're used to using it, it's a really fast way for jumping into a topic since you immediately see clusters around an article -- i.e. if you don't know anything about Literary Theory and want to figure out what the important articles and authors in the area are you can do so quickly. (An example a friend of mine used when looking for books on the topic, successfully, incidentally.)

My question would be, do you not associate what we're doing with recommendations or not see recommendations as valuable?

dualogy · on Aug 14, 2008

Cool for all the "knowledge management" aficionados, there's a whole industry around that. Also potentially useful for intranets / collaboration systems -- if at some point you can offer your API offline, i.e. as a library callable from code without having to depend on your web service. If I can package up your algorithms with a product I'm deploying, I'll be happy to pay for it (provided that real customer value is being added which for now I'm assuming).

nsrivast · on Aug 14, 2008

Check out www.wikiwarp.com/go

made along with HN user jsomers