Hey espeed, how well would the SybilGuard[0] algorithm run on Titan? Right now, we're evaluating Twitter's in-memory graph library (Cassovary[1]), but that obviously requires a big machine..., and we already use Cassandra elsewhere, so would prefer to stay with that.
Up to, say, 5 second response times on large graphs would be acceptable.
2. Faunus is an OLAP graph-analytics engine that integrates Titan with Hadoop for global analysis of Titan graphs. Graphs are analyzed using a MapReduce implementation of the Gremlin graph traversal language. General use-cases include computing graph derivations/transformations and global graph statistics. You can then feed the global-algo results back into Titan.
3. Fulgora is an in-memory, compression-based, transaction-less OLAP graph processor capable of storing billions of edges within the memory confines of a single machine. Fulgora is optimized for the execution of massively threaded, global graph algorithms. It will come out later this year, and you can connect it to Faunus or feed it directly from Titan.
If you can construct a local SybilGuard algo, you can run it in Titan and get an immediate response. Otherwise, for global-graph algos, you would feed Faunus from Titan directly and query Faunus' in-memory graph. There are also things in the works that will blur these distinctions. More details to come later this month in a series of blog posts -- stay tuned.
Marko or Matthias will have more insight on how to best run a SybilGuard-type algo. Right now they're about to jump on a flight to Austin for Data Day Texas (http://datadaytexas.com/), but I'm sure they'll respond when they have a free moment.
CORRECTION -- that should say: "Otherwise, for global-graph algos, you would feed Fulgora from Titan directly and query Fulgora's in-memory graph" (sorry if I confused anyone).
Well, I am not a Big Data guy, so I don't know if it's "real" enough in terms of capabilities or maturity, but when looking for a graph database to mess around with, I came across OrientDB, which also uses the Apache license. Clearly it's not built for the same use cases or scale as Titan but it seems to have some case studies and commercial support. I haven't played with it yet, so I suppose the project could be an elaborate hoax.
No, OrientDB is not an "elaborate hoax" :) -- it's very real and developed by Luca Garulli (https://github.com/lvca). OrientDB is one of the primary Blueprints implementations (https://github.com/tinkerpop/blueprints) so it integrates with the TinkerPop stack as well.
One tangential direction we want to go with Titan is benchmarking it as a distributed RDF store. With edge indexing now in Titan 0.3.0 and Blueprints Sail (GraphSail), Titan can represent an RDF graph and be queried using SPARQL.
I'd love to see more detailed write up about the performance. I'm working on a natural language parsing problem and have had some success using graphs to perform chunking in the past.
+1 for using Gremlin! Do you know of any python implementations of it?
I'd love to see more detailed write up about the performance.
Over the next month there will be a series of blog posts detailing the performance numbers and the key concepts behind Titan's design and architecture.
This blog post from August 2012 (http://thinkaurelius.com/2012/08/06/titan-provides-real-time...) has some details about early tests from Titan 0.1, but notice that today's Titan 0.3 numbers dwarf the 0.1 numbers (the details of how this was accomplished will be explained in the upcoming blog posts).
Within the next year (before the TinkerPop book comes out), it would be cool to have all the languages covered, including Gremlin-Jython and Gremlin-JRuby.
Gremlin-Java and Gremlin-Groovy are maintained by TinkerPop.
Gremlin-JavaScript, Gremlin-Clojure, and Gremlin-Scala are being developed/maintained by community members.
To create a Gremlin implementation, you essentially wrap Gremlin-Java in the target language's idiomatic style.
Right now most people use Gremlin-Groovy (it's the original) regardless of what language they're developing in (think of Gremlin as a domain-specific language like SQL you use in conjunction with your primary language).
For example, Bulbs (http://bulbflow.com) is a Python library I wrote that supports Rexster, Neo4jServer, and Titan. In Bulbs, you edit Gremlin scripts in Groovy text files, and when you create a Python Graph object, Bulbs sources your Groovy files and caches the Gremlin scripts in a library so they are readily available for when you want to execute them on the server.
> Gremlin-Clojure, and Gremlin-Scala are being developed/maintained by community members
Would be nice if other Groovy-based products like Gradle enabled a Scala and/or Clojure frontend to appeal to us who are fussier about what shell language we use.
I haven't tried Titan, but I've played around with neo4j a little bit in the past and found it a little frustrating to get started with.
The good part is the ACID transactions and the (beautiful) browser based interface to it. The bad part is that it was far too easy to get the database in a locked state when you perform a bad query (which you tend to do a lot as you're learning).
Graph databases are extremely powerful and will likely change the way you look at structuring a lot of data mining problems. I'd recommend trying one of them out if you ever get the chance!
I'd love to see this from someone outside the project- it's pretty clear from blog posts, etc that Titan was made to handle larger graphs, but what are the tradeoffs, and where are the pain points?
I am not outside the project, but I use Titan/Faunus with clients and here are the pain points I notice.
1. Difficult to deal with clusters: while you can run Titan single machine (and that is easy to deal with), handling a multi-machine cluster can be frustrating and requires some DevOps skills. Pull in Faunus/Hadoop and you are in a world unto its own.
2. Lots of options: with Titan there are numerous configurations and you get into this "n choose m" scenario. However, I typically just run Titan/CassandraEmbedded + (now) Elastic Search. For instance, I gave up on Titan/BerkeleyDB. Decided that Titan/CassandraEmbedded was sufficient for most needs and thats that.
3. Lots of access points into Titan/TinkerPop: You can run Blueprints Java native, Gremlin, Rexster RexPro, Rexster REST, ... Unfortunately, depending on your situation, one approach is better than another. I typically do Rexster Extensions + Gremlin. In the future though, with RexPro (haven't gotten into it personally), I will probably just use RexPro as I can send Gremlin in and get results out.
4. There are so many bodies of code: Titan is Apache2. This is good. It feeds on alot of excellent open source projects -- HBase, Cassandra, Lucene, ElasticSearch, Gremlin, Blueprints, Rexster, Frames, ... There is no one source of documentation. I'm either reading TinkerPop documentation or Titan documentation. Then when there are issues, Googling the Cassandra mailing list. Luckily, I'm an expert in TinkerPop and since Titan is native TinkerPop my knowledge transfers. However, there are lots of peculiarities (Titan typing, configurations, exceptions, cluster setup etc.) that I have to learn and master.
I am fortunate enough to work on projects that use Titan/TinkerPop and thus, as I see these pain points I continually work to solve them. Over time, we will see much of these complexities hidden. Right now, Titan is a young project and some good packaging polish will happen with good time.
If you are interested in implementing a Titan storage adapter for a new backend datastore and you have questions, you can discuss it in the Aurelius Graphs group:
Titan 0.3 was stressed tested with Cassandra at 120 billion edges and is capable of loading 1.2 million edges per second on a 16 machine hi1.4xl cluster (https://twitter.com/aureliusgraphs/status/316255164719828992).
This release provides a complete performance-driven redesign of many core components, and the primary new feature is advanced indexing.
Here are the new indexing features:
* Geo: Search for elements using shape primitives within a 2D plane.
* Full-text: Search elements for matching string and text properties.
* Numeric range: Search for elements with numeric property values using intervals.
* Edge: Edges can be indexed as well as vertices.
See http://thinkaurelius.com/news/