Titan 0.3.0 Released: Geo, full-text, edge indexing on billion edge graphs

espeed · on March 29, 2013

Titan is a new real-time, distributed, transactional graph database that can use either Cassandra or HBase as its distributed data store.

Titan 0.3 was stressed tested with Cassandra at 120 billion edges and is capable of loading 1.2 million edges per second on a 16 machine hi1.4xl cluster (https://twitter.com/aureliusgraphs/status/316255164719828992).

This release provides a complete performance-driven redesign of many core components, and the primary new feature is advanced indexing.

Here are the new indexing features:

* Geo: Search for elements using shape primitives within a 2D plane.

* Full-text: Search elements for matching string and text properties.

* Numeric range: Search for elements with numeric property values using intervals.

* Edge: Edges can be indexed as well as vertices.

See http://thinkaurelius.com/news/

erichocean · on March 29, 2013

Hey espeed, how well would the SybilGuard[0] algorithm run on Titan? Right now, we're evaluating Twitter's in-memory graph library (Cassovary[1]), but that obviously requires a big machine..., and we already use Cassandra elsewhere, so would prefer to stay with that.

Up to, say, 5 second response times on large graphs would be acceptable.

[0] http://www.math.cmu.edu/~adf/research/SybilGuard.pdf

[1] https://github.com/twitter/cassovary

espeed · on March 29, 2013

Titan is one piece of the Aurelius Graph Cluster (http://thinkaurelius.com/subscription/).

1. Titan is the OLTP piece, and it's very fast at running local-rank algorithms (http://markorodriguez.com/2011/03/30/global-vs-local-graph-r...).

2. Faunus is an OLAP graph-analytics engine that integrates Titan with Hadoop for global analysis of Titan graphs. Graphs are analyzed using a MapReduce implementation of the Gremlin graph traversal language. General use-cases include computing graph derivations/transformations and global graph statistics. You can then feed the global-algo results back into Titan.

3. Fulgora is an in-memory, compression-based, transaction-less OLAP graph processor capable of storing billions of edges within the memory confines of a single machine. Fulgora is optimized for the execution of massively threaded, global graph algorithms. It will come out later this year, and you can connect it to Faunus or feed it directly from Titan.

If you can construct a local SybilGuard algo, you can run it in Titan and get an immediate response. Otherwise, for global-graph algos, you would feed Faunus from Titan directly and query Faunus' in-memory graph. There are also things in the works that will blur these distinctions. More details to come later this month in a series of blog posts -- stay tuned.

Marko or Matthias will have more insight on how to best run a SybilGuard-type algo. Right now they're about to jump on a flight to Austin for Data Day Texas (http://datadaytexas.com/), but I'm sure they'll respond when they have a free moment.

See Marko's YOW! interview (http://channel9.msdn.com/posts/YOW-2012-Marko-Rodriguez-Grap...) and Matthias' Titan/Cassandra talk (http://www.youtube.com/watch?v=ZkAYA4Kd8JE) for more details on the architecture.

espeed · on March 30, 2013

CORRECTION -- that should say: "Otherwise, for global-graph algos, you would feed Fulgora from Titan directly and query Fulgora's in-memory graph" (sorry if I confused anyone).

mhluongo · on March 29, 2013

Very exciting stuff! Looking forward to seeing you guys at DDT.

erichocean · on March 30, 2013

Thanks espeed, the links were especially helpful.

I'm looking forward to Fulgora, sounds great!

espeed · on March 30, 2013

No problem. I sent a tweet to @erichocean -- is that you? If you want, let's chat about algo options next week when everyone is back in town.

anigbrowl · on March 30, 2013

Excellent documentation - this is what a 'getting started' page should look like. All projects should have this depth of introductory material.

eitland · on March 29, 2013

One of the most interesting part seems not to be mentioned: Apache license.

So far the only real, all-features-included graph database with a permissive open source license, - or am I missing something?

espeed · on March 29, 2013

Yeah, it's Apache 2.0 (https://github.com/thinkaurelius/titan/blob/master/LICENSE.t...), and it's the first native Blueprints implementation so it integrates with the entire TinkerPop stack (http://www.tinkerpop.com/), which is BSD.

xaritas · on March 29, 2013

Well, I am not a Big Data guy, so I don't know if it's "real" enough in terms of capabilities or maturity, but when looking for a graph database to mess around with, I came across OrientDB, which also uses the Apache license. Clearly it's not built for the same use cases or scale as Titan but it seems to have some case studies and commercial support. I haven't played with it yet, so I suppose the project could be an elaborate hoax.

http://www.orientdb.org/

espeed · on March 29, 2013

No, OrientDB is not an "elaborate hoax" :) -- it's very real and developed by Luca Garulli (https://github.com/lvca). OrientDB is one of the primary Blueprints implementations (https://github.com/tinkerpop/blueprints) so it integrates with the TinkerPop stack as well.

iand · on March 30, 2013

Apache Jena is a graph database and has been an Apache project since 2011 and graduated from the incubator in Apr 2012

http://jena.apache.org/

okram · on March 30, 2013

One tangential direction we want to go with Titan is benchmarking it as a distributed RDF store. With edge indexing now in Titan 0.3.0 and Blueprints Sail (GraphSail), Titan can represent an RDF graph and be queried using SPARQL.

   https://github.com/tinkerpop/blueprints/wiki/Sail-Ouplementation

feniv · on March 30, 2013

I'd love to see more detailed write up about the performance. I'm working on a natural language parsing problem and have had some success using graphs to perform chunking in the past.

+1 for using Gremlin! Do you know of any python implementations of it?

espeed · on March 30, 2013

I'd love to see more detailed write up about the performance.

Over the next month there will be a series of blog posts detailing the performance numbers and the key concepts behind Titan's design and architecture.

This blog post from August 2012 (http://thinkaurelius.com/2012/08/06/titan-provides-real-time...) has some details about early tests from Titan 0.1, but notice that today's Titan 0.3 numbers dwarf the 0.1 numbers (the details of how this was accomplished will be explained in the upcoming blog posts).

espeed · on March 30, 2013

+1 for using Gremlin! Do you know of any python implementations of it?

Note that Marko is the creator of Gremlin :)

There are Gremlin implementations in various stages of development for almost every major JVM language:

Gremlin-Java (base implementation) - https://github.com/tinkerpop/gremlin/wiki/Using-Gremlin-thro...

Gremlin-Groovy (original) - https://github.com/tinkerpop/gremlin/wiki/Using-Gremlin-thro...

Gremlin-JavaScript - https://github.com/entrendipity/gremlin-js

Gremlin-Clojure - https://github.com/zmaril/ogre

Gremlin-Scala - https://github.com/mpollmeier/gremlin-scala

Within the next year (before the TinkerPop book comes out), it would be cool to have all the languages covered, including Gremlin-Jython and Gremlin-JRuby.

Gremlin-Java and Gremlin-Groovy are maintained by TinkerPop.

Gremlin-JavaScript, Gremlin-Clojure, and Gremlin-Scala are being developed/maintained by community members.

To create a Gremlin implementation, you essentially wrap Gremlin-Java in the target language's idiomatic style.

If you are interested in helping develop Gremlin-Jython or Gremlin-JRuby (or any other implementation currently in development), please post to the Gremlin Users Group (https://groups.google.com/forum/?fromgroups=#!forum/gremlin-...).

Right now most people use Gremlin-Groovy (it's the original) regardless of what language they're developing in (think of Gremlin as a domain-specific language like SQL you use in conjunction with your primary language).

For example, Bulbs (http://bulbflow.com) is a Python library I wrote that supports Rexster, Neo4jServer, and Titan. In Bulbs, you edit Gremlin scripts in Groovy text files, and when you create a Python Graph object, Bulbs sources your Groovy files and caches the Gremlin scripts in a library so they are readily available for when you want to execute them on the server.

See http://bulbflow.com/docs/api/bulbs/rexster/gremlin/

vorg · on March 30, 2013

> Gremlin-Clojure, and Gremlin-Scala are being developed/maintained by community members

Would be nice if other Groovy-based products like Gradle enabled a Scala and/or Clojure frontend to appeal to us who are fussier about what shell language we use.

richardjordan · on March 29, 2013

Really happy to see this. Testing with Titan at the moment and very happy with it so far.

dubcanada · on March 30, 2013

The coolest thing about this is the getting started narration. I love Greek mythology!

Goranek · on March 29, 2013

Comparison with neo4j?

feniv · on March 30, 2013

I haven't tried Titan, but I've played around with neo4j a little bit in the past and found it a little frustrating to get started with.

The good part is the ACID transactions and the (beautiful) browser based interface to it. The bad part is that it was far too easy to get the database in a locked state when you perform a bad query (which you tend to do a lot as you're learning).

Graph databases are extremely powerful and will likely change the way you look at structuring a lot of data mining problems. I'd recommend trying one of them out if you ever get the chance!

rpedigoni · on March 29, 2013

on Titan you can choose your storage backend (currently, BerkeleyDB, Cassandra and HBase) according to your needs: https://github.com/thinkaurelius/titan/wiki/Storage-Backend-...

espeed · on March 29, 2013

Titan is distributed (but can be run in single-server mode). Neo4j is master/slave.

btown · on March 29, 2013

Also note that neo4j requires a license to run in high-availability/multi-server mode.

mhluongo · on March 30, 2013

I'd love to see this from someone outside the project- it's pretty clear from blog posts, etc that Titan was made to handle larger graphs, but what are the tradeoffs, and where are the pain points?

okram · on March 30, 2013

I am not outside the project, but I use Titan/Faunus with clients and here are the pain points I notice.

1. Difficult to deal with clusters: while you can run Titan single machine (and that is easy to deal with), handling a multi-machine cluster can be frustrating and requires some DevOps skills. Pull in Faunus/Hadoop and you are in a world unto its own.

2. Lots of options: with Titan there are numerous configurations and you get into this "n choose m" scenario. However, I typically just run Titan/CassandraEmbedded + (now) Elastic Search. For instance, I gave up on Titan/BerkeleyDB. Decided that Titan/CassandraEmbedded was sufficient for most needs and thats that.

3. Lots of access points into Titan/TinkerPop: You can run Blueprints Java native, Gremlin, Rexster RexPro, Rexster REST, ... Unfortunately, depending on your situation, one approach is better than another. I typically do Rexster Extensions + Gremlin. In the future though, with RexPro (haven't gotten into it personally), I will probably just use RexPro as I can send Gremlin in and get results out.

4. There are so many bodies of code: Titan is Apache2. This is good. It feeds on alot of excellent open source projects -- HBase, Cassandra, Lucene, ElasticSearch, Gremlin, Blueprints, Rexster, Frames, ... There is no one source of documentation. I'm either reading TinkerPop documentation or Titan documentation. Then when there are issues, Googling the Cassandra mailing list. Luckily, I'm an expert in TinkerPop and since Titan is native TinkerPop my knowledge transfers. However, there are lots of peculiarities (Titan typing, configurations, exceptions, cluster setup etc.) that I have to learn and master.

I am fortunate enough to work on projects that use Titan/TinkerPop and thus, as I see these pain points I continually work to solve them. Over time, we will see much of these complexities hidden. Right now, Titan is a young project and some good packaging polish will happen with good time.

agilord · on March 30, 2013

Any plans to support other storage backends? Postgresql and Riak comes in my mind.

espeed · on March 30, 2013

Anyone can implement new Titan storage adapters by implementing a few interface classes.

Look at the Titan/BDB storage adapter for a simple example:

https://github.com/thinkaurelius/titan/tree/master/titan-ber...

It implements the KeyValueStore interfaces (there are other interfaces for different types of DBs, such as KeyValueColumnStore, etc):

https://github.com/thinkaurelius/titan/blob/master/titan-cor...

If you are interested in implementing a Titan storage adapter for a new backend datastore and you have questions, you can discuss it in the Aurelius Graphs group:

https://groups.google.com/forum/?fromgroups#!forum/aureliusg...