Hacker News new | past | comments | ask | show | jobs | submit login
I Dreamed of a Perfect Database (newrepublic.com)
50 points by marcopolis on Dec 5, 2015 | hide | past | favorite | 41 comments



Some perspective on this subject:

- A graph database is essentially a relational database plus join recursion. Many modern relational databases support join recursion; support for graph models is ubiquitous, they just don't call themselves "graph databases". There are other reasons graph-like computation is relatively rare.

- There are some important types of relationships that are effectively not representable in graph data models. Relationships that have a topological nature, such as negative constraints, spatiotemporal, etc are notoriously problematic.

- General graph-like operations have terrible scalability and performance characteristics as commonly implemented. Consequently, people almost never organize their data models this way unless it is absolutely unavoidable. Even Facebook materializes common traversals as non-graphs so that they do not have to execute them dynamically. The heavy reliance on secondary indexing for performance in most graph databases ensures they will be marginal for large-scale data models.

- Semantic Web fails at scale because no two people map reality to definitions in the same way. This is particularly obvious at global scales because of cultural influences on how we interpret the world. If you've ever done global data model integrations, and I have, you quickly realize that the only values that are approximately consistent globally are physic-based measurements e.g. "650nm wavelength" (instead of "red"). The nodes have no common definition in practice, just commonly overlapping parts of countless subjective interpretations of the definition, which leads to pollution of the data model in real systems.


Hi Andrew - Have you explored holographic algorithms/encodings that use shape to represent elements such that you can succinctly and uniquely define an object/relation in terms of the composition of its constituent parts in a way that enables fast decomposition back into its constituent parts?


> Have you explored holographic algorithms/encodings[...]

As far as the wording in mainstream science is, holographic algorithms are something entirely different, see https://en.wikipedia.org/wiki/Holographic_algorithm.


Vladimir Kornyak touches on some of these ideas in these papers:

1. On Compatibility of Discrete Relations (2005) http://arxiv.org/pdf/math-ph/0504048.pdf

2. Structural and Symmetry Analysis of Discrete Dynamical Systems (2010) http://arxiv.org/pdf/1006.1754.pdf

3. Discrete Dynamical Models: Combinatorics, Statistics and Continuum Approximations (2015) http://mmg.tversu.ru/images/publications/2015-vol3-n1/Kornya...

On a somewhat related note -- applying quantum principles to graph database processing -- see this new paper by Marko Rodriguez (the creator of the Gremlin graph programming language):

4. Quantum Walks with Gremlin (2015) http://arxiv.org/pdf/1511.06278v1.pdf


Graph databases are exciting, but I'm far more interested in the potential of append-only stores. Rather than recording data at all, you store events (item added, item deleted, etc.).

This allows auditability and for you to look back at the state of the data at any time, but the largest benefit, in my opinion, is that it decouples data from its data structure. This allows you to treat data structures like "caches" that are efficiently structured for how they will be used. If you want, you don't have to choose between relational databases or graph databases or anything else: you can play the same set of events into different structures and query the appropriately structured database for the kind of query you're doing. It also allows you to implement security at the data storage level in a very simple and granular way: you can reject events based on predicates which update as themselves as they receive modifications to the permissions, and distribute filtered streams of data to users based on what events they are allowed to see. Overall, the power of this method is very large.


Datomic much? :)


For folks who are interested, some other examples are the Kafka/Samza ecosystem and blockchains like Ethereum.


Yes. :)


  All of these tables relate to one another

  Where all this gets particularly interesting is wherever,
  by traversing the relations between table...

  These are the merits of the “relational” database. 
That's not what "relational" in "relational databases" means.

He even links to Codd's paper which defines a "relation" explicitly:

  The term relation is used here in its accepted mathematical sense.
  Given sets X1 , S, , . . . , S, (not necessarily distinct), R is
  a relation on these n sets if it is a set of n-tuples each of which
  has its first element from S1, its second element from Sz, and so on.
  ...


Sorry, this is a terrible article. It reads more as a stream of consciousness about his own mistakes and sad regrets, and does not add any new or interesting insight into anything regarding databases.

There are plenty of new cool things we can do now - and the graph database we're building ( http://github.com/amark/gun ) is doing several of those things. Like realtime by default, like what Firebase and RethinkDB are starting to push. Fully decentralized and fault tolerant, like what Riak and Cassandra tried to do. Graph (and not just triples) so you can have relational documents as ArangoDB and Neo4j allow. And absolutely totally offline-first, like what Pouch/CouchDB wanted to pioneer.

The truth is the world of databases are only getting better and more exciting - in large part due to new algorithms like CRDTs that push what we can do. GUN is one player in that along with others. But this article? Just depressing and overlooking all the new opportunities.


I often see you plug your project, Gun, whenever someone posts about distributed databases. You always write about how great it is. But every time I look at your repo, it's still the same few hundred lines of, I'm sorry, quite shoddily written JavaScript, with huge chunks clearly being temporary placeholders or even commented out. You even have scratch test data in there, intermingled with your main code. Your web site also make some grandiose claims about how it solves certain synchronization problems, but I can't find any technical explanation of exactly how it is supposed to work.

The lack of code or documentation means it's very hard to take your comments seriously. But please correct me if I'm mistaken.


I attended @marknadal's talk at Distributed Matters Berlin (https://2015.distributed-matters.org/ber/speakers/#284925727...). It promised a lot.

After the talk, I was left with many unclear messages, lots of promises, few detailed explanations, IMVHO many flawed statements, and only one clear outcome: gun is a "script" (I wouldn't call it database) that offers "conflict resolution" based on data's lexicographical order (!!!).

I just probably understood everything wrong, or I'm simply inexperienced enough to appreciate gun. My apologies in advance.


I'm sad to hear I did a poor job explaining the concepts, thank you for this honest review. Next time around, I'll see if I can be more clear.

One minor detail - while lexical sort is used in the conflict resolution algorithm it is only part of the equation (see https://github.com/amark/gun/wiki/Conflict-Resolution-with-G... ). As I mentioned in my talk, lexical ordering is intentionally naive because I rely upon deterministic behavior (not PAXOS, RAFT, consensus, or gossip protocols) which guarantees convergence.


Maybe it was just me. However, if my feedback would help, I'd say that there was a lack of technical detail of the stuff mentioned. It was only covered from a superficial point of view, and many different complex issues where mentioned somehow together and lightly.

But the way, why do you easily discard consensus protocols and rather rely on conflict resolution?


What is the best medium for presenting more technical details? In my talk I attempted to use stories (which is a very laymen approach) to explain algorithms - this may have been too high level. So instead, what is easiest for you to digest and verify? Actual code and working samples/demos that prove the behavior? Mathematics? Case studies by large customers?

-------

I dislike consensus protocols because they are difficult to scale. Deterministic algorithms however are not. Why? Consensus requires communication, and communication takes time. As you add more peers, you then have to do more communication in order to maintain consensus. But as more communication occurs, things bottleneck and get even slower. However some problems require consensus (like finances, traditionally) and therefore GUN would not be a good choice.

Deterministic algorithms only need the same inputs and they then spit out the same output as any other machine (running that algorithm) anywhere in the universe. This is why "immutable data structures" are all the rage lately. This is incredibly scalable because as long as the inputs are received (which might have been sent slowly over the network) then all the databases will converge to the same result (for the same inputs) regardless of whatever current state those databases might be in. And because a database maintains state, this type of guarantee is really important for databases.

WARNING NOTE: Missing input causes the databases to be temporarily out of sync because their inputs are different, however when the inputs are finally received (via retries or what not) the databases will sync up regardless of the ordering. Making sure the inputs can be sent in any order or retried is the idea of idempotency and is explored more with CRDTs and commutativity. This means GUN is "Eventually Consistent" and not Strongly Consistent, so don't use GUN in those scenarios.


Thanks for the comment but I'm going to provide a hard rebuttal:

1) You assume that you have to have lots of code in order to solve hard problems. If that is your approach to code then I cannot appease you. GUN takes the UNIX/NodeJS philosophy - it is suppose to do one thing well and everything else is solved via extensions and plugins. What GUN focuses on is conflict resolution which you can see demos and learn more about here: https://youtu.be/-i-11T5ZI9o , https://github.com/amark/gun/wiki/Conflict-Resolution-with-G... , and towards the end of this tech talk https://youtu.be/kzrmAdBAnn4 .

2) I'm sorry you think my javascript is shoddy, however this is either a matter of arbitrary aesthetic opinion or some objective measure. You provide no objective measure. I'll propose one instead - does it work? And it does, however I'll be the first to admit it is not free of bugs. We're actually dealing with some bad bugs that emerge when you do more complicated permutations of commands, to which we're in the middle of a big rewrite. I hope you'll find the new version (0.3.x, not out yet) as less shoddy. However apart from subjective aesthetics, I don't find your comment to have any justification.

3) What do you mean that we have scratch test data in there intermingled with the main code? I don't think this is true, but I'm not sure what you mean. If you think this is an issue it would be great if you raised an issue on the project. Thanks in advance. :)

Please let me know if you've actually read through the entire wiki ( https://github.com/amark/gun/wiki ) and if there is anything you need to learn more about (which I'm sure there is) please don't hesitate to ask me and I'd be happy to write documentation on it. In fact, we need people like you to report what parts of the documentation are confusing or incomplete in order for us to know to fill it in.

Thanks for your comments, and please don't take me harshly - I just want to be hard on dispelling the terrible assumption that lines of code equates to quality or problem solving. Often times more lines just creates more bloat, monolithism, and increased bug complexity.


My main point is that it's impossible to reconcile your lofty claims about Gun with, frankly, the unreadable mess you have in your repo.

It doesn't matter if your ideas are great. You keep putting Gun forward as a kind of shiny, superior successor to Riak, Cassandra and CouchDB, which I'm afraid only makes you look foolish.

You seem to have done the branding before you did the code. First you need to demonstrate solid, mature code. Then you can try competing with the big boys.

A detailed critique of your code would take a lot of effort, but I've made some notes [1]. I find the code to be unreadable and full of bad code smell. With all the mutation and callbacks and events and what not going on, it's just too hard to tell how the code is intended to flow. I don't see a clean API, even an internal one, nor I see a clean structure. It needs a lot of cleanup [2].

When talking about code, it's rarely just aesthetic. My philosophy is that if one can't even manage the trivially superficial aspects of code (formatting, structuring files, naming variables, and so on), it's not worth talking about its inner workings.

Sorry if this comes across as terribly harsh, but I hope you can see I'm trying to be constructive.

---

[1] A few examples:

* Your code is littered with arbitrary console.log() statements that have no actual utility. They look the kind of temporary logging I insert whenever I need to trace a difficult behaviour.

* No code is documented, aside from childish comments like:

    } // IT WORKS!!!!!!
* Lots of what looks like global state.

* Lots of anonymously named variables like r, u, n.

* Big chunks of commented-out code (e.g., in radix.js).

* Lots of what looks like test data. Are you telling me this is part of the project?

    rad('user/marknadal', {'#': 'asdf'});
* Lots of mutation. E.g., in s3.js you get some kind of event and then you modify the inputs.

* Syntax formatting is inconsistent and shoddy.

* require() calls in the middle of code.

* You have a test folder, but it's barely got any actual tests. You do have at least one commented-out test in the middle of everything (list.js, line 25).

[2] You may want to look into replacing callbacks with promises, and restructure it using more classical ES6-based OO, as opposed to the prototype-based, lexically-scoped style you're using now.


If we can find some common ground that we can measure I think that would help this discussion. In my previous post I proposed that we check my claims with whether gun works or not. These were my claims:

Claim 1. That gun is realtime. Claim 2. That gun is offline-first. Claim 3. That gun is a graph database. Claim 4. That gun is decentralized.

I then linked to the follow 1min video ( https://youtu.be/-i-11T5ZI9o ) to back up my claims 1&2&3. Either I am cheating/lying in the video, or gun does as I claim. Which is true?

Then, after whether the above has been verified we can move onto the next issues (which you addressed, but ignored the earlier). Next up, I would agree that GUN is not production ready and therefore wouldn't be appropriate to stack up against Riak, Cassandra, and CouchDB. I will agree with you there. But this is merely a matter of time (we're already working on a battle suite test, hopefully getting Jepsen set up, etc.) and unfortunately it is important to do evangelism while you are developing a system so you can have beta testers. We need that. So please help out! Find bugs and problems, report them. :)

Third, regarding code I don't think we'll be able to find common ground (other than the fact that we are already doing a rewrite, which suggests we agree to some degree). Why? Because I'm not one of those ES6 fan-loving Promise-everything boys. I'm not sure if you are or not, but your comment seemed suggestive of that direction. This won't lead to a fruitful discussion, because it is again a subjective view of the aesthetics of code (I find single character variables as easier to logically understand and maintain, for instance. This makes most people balk, but why? Because I am not an app developer I do not know what people are going to use the data for. I'm a tool developer, which follows more the work and lines of mathematics... `x` is a variable! It could be an integer, a float, or whatever else. This again is a personal preference so please don't downvote me for my personal coding style.)

If you choose not to use GUN because you disagree with my coding style, that is your choice and saddens me. But I will be very strong: please do not use that as an excuse to dismiss my claims. My claims can be checked and verified and I have backed them up.


Respectfully, the article is terrible from your perspective. To someone not steeped in the state of the art goings on of modern DB it serves as a reasonable introduction.

E.G. - Imagine yourself with a passing knowledge of RDB based on common web development. This serves as an eye opener.

You personally, may not be the intended audience.


Fair enough. I accept this critique.


I take issue with the use of past tense for the Couch ecosystem. Things in our neck of the woods indicate the hockey stick is kicking in for offline first, especially with the Internet of Things.

Example: http://www.couchbase.com/nosql-resources/presentations/offli...


In the consumer apps space we are seeing airline booking and other location critical verticals pick us up. Here's Ryanair's story. http://www.couchbase.com/nosql-resources/presentations/how-r...


Thank you for correcting me. I obviously have not kept up with the ecosystem (I jumped ship sometime in 2011), so pardon me. I'm excited about the progress!


"without the motive power of capital, things move slowly"

slowly?! We've amassed huge collections of detailed knowledge in mere decades, and it's all searchable and discoverable.

"not one big pool of knowledge"

The real fantasy is centralization and taxonomies of everything. Distributed isn't a lazy thing we've settled on; it's the best approach. Each island grows about as far as it makes sense, and connects to other islands where it makes sense.

Sure, it may be imperfect and some datasets are hidden in corporations, but open data is certainly a healthy and growing ecosystem in no danger of extinction.


> The real fantasy is centralization and taxonomies of everything

Precisely. Currently there are plenty of graph-based databases to choose from - GraphDB, ArangoDB, Neo4j, Allegro, etc etc. They are all pretty good, and some also support rule-based inferencing, i.e. you can put a rule in the database like "if X is a person and belongs to team Y and the manager for team Y is Z, then Z is X's manager".

Systems like that have a few disadvantages:

1. They absolutely implicitly trust all data you put in them. If in the above example you add Z as a member to team Y, then it will infer that Z is Z's manager.

2. You cannot assert negatives, nor can you search based on negative predicates.

The first is the absolute killer for such technologies if you try to deploy them internet-wide and gather "data" from any random source publishing RDF documents.


ad 1: why not attach a credibility score to relations (probably aggregated from multiple sources) and query against those too?


What inference rules for unreliable data do you suggest?


Probabalistic Soft Logic, Markov Logic, Natural Soft Logic (https://youtu.be/EX1hKxePxkk?t=23m00s). Despite those operate on the semantic level to credit/discredit drawn inferences and treat phrases at face value, I assume you can factor in credibility from sources as well.


is there something wrong with inductive logic? https://en.wikipedia.org/wiki/Inductive_reasoning


Inductive logic needs an implementation. These are things like Bayes Nets, Inference using Gibbs sampling, PSL, etc.

You can sorta, mostly get them to work if you have a clean graph, and are prepared to spend a long time debugging.

The problem is that no one has managed to get them to work well enough to be useful on dirty, web-scale knowledge graphs.


> There is as yet no absolute challenger to the relational model. When people think database, they still think SQL. But if there is a true challenger, it is in the graph model. Because graph data structures power social networks, and social networks are the dominant technological organism of the era.

You can use a fact database model (like Datomic's) and then you don't have to choose between row, column, document and graph based databases, those are projections on top of some index.

If entities are universally unique and attributes namespaced you can get a "database of everything". Or you need at least a mapping between entities from different vendors. That's the hard part :)


I'm old enough to remember being excited about FOAF too.

The idea came out of the world of blogs; if you have a blog, so the thinking went, you basically have an online identity endpoint. So if there was a simple way to describe relationships between those endpoints, you could write software to follow those relationships, and then... voila! A completely open social network.

The dream collided unpleasantly with reality in two places. First was the problem that running a blog implies a commitment to continual writing that most people are never going to make, so when your first step is "set up a blog somewhere" people think you're asking them to make a huge commitment and run away. And second, FOAF was built on RDF; RDF makes my brain hurt just thinking about it, and I'm a professional nerd! (The refrain was always "it doesn't matter how complicated the format is, users will only interact with it via tools that make it simple." But every time I hear this about a format it reminds me of all the other formats I've heard that about in the past, all of which are deader than doornails now.)


Perhaps I'm stereotyping, but whenever I read a love letter to graph databases, it usually begins with taking MySQL as the limits of the relational world.


The real power of relational databases is:

* it's fantastic for many kinds of queries you didn't have in mind when you created the data model

* it's good enough for most use cases

* it's been around long enough that safety and correctness features like transactions, foreign keys, UNIQUE and NOT NOLL constraints etc. are well implemented and understood.

Any DB model that wants to challenge the ubiquity of RDBMS must address these points, and offer something on top. So far I haven't seen anything that would come close. Sure, there are lots of specialized use cases and specialized tools that are great in some areas (like geospatial data, graphs, time series, ...), but nothing that I'd use as a default choice for a project with vague requirements.


I'm in love with Paul Ford's writing about technology. His What Is Code? cover story for Business Week was amazing: http://www.bloomberg.com/graphics/2015-paul-ford-what-is-cod... https://news.ycombinator.com/item?id=9698870

This is just a short piece, but I think it is so interesting to try to introduce non-technical readers to some of the knotty problems that everyone on HN takes for granted. And, the fact that the world's largest graph database (Facebook's) is built on top of a SQL database really is ironic, and it's great that he succeeds at explaining all the concepts well enough to get that across.

For dinosaurs like myself who still use an RSS reader (I'm a big fan of Inoreader), here is a full-text feed of his New Republic writing, made with Feedity and Five Filters: http://ftr.fivefilters.org/makefulltextfeed.php?url=https%3A...


I think the future "may" be the JCR Standard (Java Content Repository), or at least something build 'on top of' JCR. The only real contender IMO is Apache Jackrabbit Oak. In general what this article describes is a Content Repository, but a real world-wide semantic web has not yet been built with it. My little project meta64.com is a mobile front end for some of this kind of data processing/storage.


"There is as yet no absolute challenger to the relational model. When people think database, they still think SQL. But if there is a true challenger, it is in the graph model."

This article is quite biased towards graph databases with regards to the SQL versus NoSQL tension. This video presents a much more balanced view of SQL versus NoSQL, in my opinion. https://www.youtube.com/watch?v=qI_g07C_Q5I


This article doesnt address what is useful as we migrate into the cloud. First, the data may be larger than any disk image in the cloud. The data could be distributed across hundreds of disks in disparate geographical locations. Plus there may be hundreds of process or services reading and modifying the data. Certain NonSQL architectures seem to work better then.


A guy writing about "graph databases" without once mentioning IMS, hmm http://www-01.ibm.com/software/data/ims/


I woke up and find riak, hope they make search on by default but one could argue that perfection is not a thing.


Check out Brodlist http://brodlist.com




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: