Hacker News new | past | comments | ask | show | jobs | submit login
Now that people are considering NOSQL will more people consider no-DB (martinfowler.com)
227 points by fogus on Aug 31, 2011 | hide | past | favorite | 140 comments



After reading the article and all the comments here, and from my own experience, I just don't think it's possible to not have a DB. At best, you write your own basic DB, because you don't need anything fancy.

For example, you write S-expressions to files like Hacker News does. This is clever, because the file system has some of the features of a database system, and files and S-expressions are abstractions that already exist. You do have to manage what data is in memory and what data is on disk at any given time, but the complexity and amount of code are low.

The idea that "event sourcing" somehow keeps you from needing a DB is ridiculous. By the time you've defined the event format, and written the software to replay the logs, etc., which if you're smart will be fairly general and modular, congrats, you've just written a database. At best, you keep complexity low, and it's another example of a small custom DB for a case where you don't need a fancy off-the-shelf DB. Maybe it's the perfect solution for your app, but it's still a database.

"Memory images," as a completely separate matter, are an abstraction that saves you some of the work of making a DB. Just as S-expressions can save you from defining a data model format, and files can save you from a custom key-value store, memory images as in Smalltalk could save you from having to deal with persistence. And if your language has transactions built in, maybe that saves you from writing your own transaction system. In general, though, it's very hard to get the DB to disappear, as there is a constellation of features important to data integrity that you need one way or another. It's usually pretty clear that you're using a DB, writing a DB, or using a system that already has a DB built in. If you think there's no DB, there's a high chance you're writing one. Again, that could be fine if you don't need all the features and properties of a robust general-purpose DB.

Funnily enough, in EtherPad's case, we had full versioned history of documents, and did pretty much everything in RAM and application logic -- a pretty good example of what the article is talking about -- and yet we used MySQL as a "dumb" back-end datastore for persistence. Believe me, we tried not to; we spent weeks trying alternatives, and trying to write alternatives. Perhaps if every last aspect of the data model had been event-based, we could have just logged the events to a text file and avoided SQL. More likely, I think, we would use something like Redis now.


Of course, filesystems are a kind of database as well.


Yes, agreed, which is why I find Fowler's (and others) documents a little hard to take seriously. They ARE writing to disk, and doing a lot of other work to try and keep the data intact in case of failure...which is what a DB does. I'm all for the idea of keeping some data in memory for speed, but moving it all to resident memory is just moving the same components around.


Was there something specific it the data model that made the versioning hard to write? Or was it that, for this to work across the board, the entire model had to be versioned?

It sounds that SQL itself wasn't the problem. Were you looking for versioning alternatives in SQL that weren't up to par?


SQL was a solution more than a problem; we were just hoping for a simpler or more elegant solution.

For example, it's possible that logging all data model changes to a text file would have given us the persistence we were looking for without bridging all our data to SQL. Cutting out SQL from our production set-up would have been an inherent win -- one less process to manage, one less black-box source of complexity, etc.


In many applications, data outlives code. This is certainly the case in enterprise applications, where data can sometimes migrate across several generations of an application. Data may also be more valuable to the organization than the code that processes it.

While I'm no fan of databases, one obvious advantage is that they provide direct access to the data in a standard way that is decoupled from the specific application code. This makes it easy to perform migrations, backups etc. It also increases one's confidence in the data integrity. Any solution that aims to replace databases altogether must address these concerns. I think that intimately coupling data with the application state, as suggested in the article, does not achieve this.


The goal is not to replace databases altogether. The goal is to solve some particular problems very well. Last time I used this approach, for example, we mirrored a bunch of data in a traditional SQL store for reporting and ad-hoc querying, things that databases are great at.

In my view, direct access to data decoupled from application code is a bug, not a feature. With multiple code bases touching the same data, schema improvements become nearly impossible.

I also think data integrity is easier to maintain with a system like this. SQL constraints don't allow me to express nearly as much about data integrity as I can in code. Sure, I could use stored procedures, but if I'm going to write code somewhere, I'd rather it be in my app.


"I also think data integrity is easier to maintain with a system like this."

If you are in the middle of a transaction and you realize that some constraint is being violated, how do you roll it back without interfering with the other transactions?


I can't speak to all systems like this, but the Prevayler approach is pretty straightforward. Most importantly, there are no simultaneous transactions: changes happen one at a time. That seems crazy if you're used to dealing with disk-backed databases, but if everything is hot in RAM, then it's not a problem. In that context, it's pretty easy: when you start executing a change you verify all your constraints before doing anything.


"Most importantly, there are no simultaneous transactions... but if everything is hot in RAM, then it's not a problem"

Uh, OK. So, you're happy with single-core boxes then, I take it?

Actually, you're regressing to even before that, when there was no pre-emptive multitasking. When the program is done doing something, it yields control to some other task.

Also, I'd like to point out that just because you aren't explicitly doing I/O doesn't mean that you aren't doing I/O. The OS might have paged out some stale data (quite likely, since you aren't managing I/O yourself), and you might be holding the giant lock while it's paging it in.

"I can't speak to all systems like this, but the Prevayler approach is pretty straightforward."

I just want to clarify: so when you encounter a problem, you do some rollback, which automatically moves the state to the last snapshot and rolls forward to the previous transaction, right? No manual steps?

I hope you have a recent snapshot or that will be a long wait (while holding the big lock, I might add).


> Uh, OK. So, you're happy with single-core boxes then, I take it?

Not at all. You use only one core for the core execution of write transactions, but that's a small part of any real system. All cores can read simultaneously. All cores can also do all sorts of other work, including preparing transactions to execute, deserialization of requests, rendering responses, logging, and anything else your app needs to get up to.

The limit is also one core per transactional domain. If you can split your data up into lumps between which you never need transactions, you can happily run one core on each.

> Also, I'd like to point out that just because you aren't explicitly doing I/O doesn't mean that you aren't doing I/O.

Actually, it does explicitly do I/O. You do it just before every command executes.

> The OS might have paged out some stale data.

I guess that's possible, which would indeed cause a momentary pause, but this approach is typically used with dedicated servers and plenty of RAM, so it's never been a problem in practice for me.

> I just want to clarify: so when you encounter a problem, you do some rollback, which automatically moves the state to the last snapshot and rolls forward to the previous transaction, right?

You mean a bug in our code that causes a problem? Depends, on the system, I suppose. Prevayler had an automatic rollback. It just kept two copies of the data model in RAM; if a transaction blew up it would throw out the possibly tainted one. But there are a number of ways to solve this, so I don't advocate anything in particular. Other than heavy unit testing, so that things don't blow up much.


"All cores can read simultaneously."

Unless someone is writing, of course, in which case you have to worry about isolation. So a single writer would block all readers, right?

"You mean a bug in our code that causes a problem?"

No, I mean like "I already wrote some data, but now a constraint has been violated so I need to undo it".


> So a single writer would block all readers, right?

Correct. For the fraction of a millisecond the transaction is executing, anyhow. Since transactions only deal with data hot in RAM, transactions are very fast.

> No, I mean like "I already wrote some data, but now a constraint has been violated so I need to undo it".

That shouldn't happen, and I've used two approaches to make sure. One is do all your checking before you change anything. The other is to make in-command reversion easy, which is basically the same way you'd make commands undoable.

Basically, instead of solving the problem with very complicated technology (arbitrary rollback), you solve it with some modest changes in coding style. Since you never have to worry about threading issues, I've found it pretty easy.


> Correct. For the fraction of a millisecond the transaction is executing, anyhow. Since transactions only deal with data hot in RAM, transactions are very fast.

Transactions don't just read and write. They sometimes compute things, like joins, which can take several milliseconds. These computations often must run within the transaction and would thus need to acquire the lock for several milliseconds.


Do you have a real-world example?

Joins haven't been a problem for me, mainly because this approach doesn't constrain you to a tables-and-joins model of the world. With Prevayler, for example, you treat things as a big object graph, so there are no splits to join back up.

Of course, it could be that some problem is just computationally intense, but I can think of a number of approaches to lessen the impact of that in a NoDB system.


I also think data integrity is easier to maintain with a system like this. SQL constraints don't allow me to express nearly as much about data integrity as I can in code. Sure, I could use stored procedures, but if I'm going to write code somewhere, I'd rather it be in my app.

Surely I could just as easily say 'Sure, I could use a data access layer in my app but if I'm writing a multi-app database, I'd rather the database enforced its own integrity.'?

Nothing's the perfect tool for every job, but I certainly think stored procedures have their place and have the power to handle the bulk of tasks.


Yes, you can definitely do it either way. Years ago as a demo a friend built the heart of a financial exchange in stored procedures. It was very fast, and very reliable. But the same is true about the LMAX system that Fowler describes.

Personally, though, I'd much rather do my important coding in a real programming language. Better tools, more libraries, bigger communities, and no vendor lock-in.


Hmmm.

Modern SQL dialects are Turing complete and frankly pretty rich dialects. I know MS SQL Server best so can't speak in detail for others, but the community around that is certainly very substantial. Library support, well, doesn't work quite the same way (yet!) but there's plenty of libraries of code samples available for adapting. Vendor lock-in is a tricky one; by the time you've got to a certain scale of application I tend to think you're programming as much to the API (whether it's the provider's standard API or your own specific API layered over the underlying platform) as to the official 'language'; lock-in can creep up surprisingly easily. Facebook avoided vendor lock-in by writing in open PHP and have since had to write their own PHP compiler to get the performance they needed from the solution they were locked in to.

A former employer used to bulk process EDI order lines in very large quantities. Deduplicating them, dynamically rebatching them according to what was available and what wasn't, updating orders with newer product where customer had specified 'this or better', cross-referencing against multi million row datasets of cataloguing and tagging information to identify how to handle the item. It was a monster; I hate to think about the volumes of data that touched each batch, and with processing orders it absolutely had to have transactional integrity. And yet, written in SQL and running on a very average commodity server, it was actually very fast. The data never left the server until it was ready to do so and all the loads stayed happily internal. The implications of trying to implement it on a NoDB solution - the volumes of data being passed around, the amount of data specific library code the DBMS provides but the underlying language doesn't which would need reading..... It's not pretty.

I don't maintain SQL is the perfect language for everything, that's patently silly. But I do maintain it's a lot more powerful (and with good performance and reliability) than it's given credit for on some very complex operations, and that a lot the reasons people prefer to work in alternatives boil down to lack of understanding. A little learning of what a modern DBMS is capable of can reap huge rewards of saving work in the 'real programming language', as you put it.


Like you, I believe in the right tool for the job. For many applications, an SQL server is awesome.

I'm sure the developer community for MS SQL Server is reasonably large, but it is much, much smaller than the Ruby community or the Java community. The same is true for library code for each environment.

I think the Facebook example cuts the other way. If they had implemented all their core application logic in MS SQL Server, they would have been well and truly screwed if the performance wasn't enough. With PHP, at least they could write their own compiler; trying to reverse-engineer MS SQL Server is orders of magnitude harder.

Regarding the "volumes of data being passed around" part, that works well with a hot-in-RAM system; no data is passed anywhere. As with doing it all in stored procedures, no data leaves the server.

I do agree that a lot of people don't get full value out of their database. Sometimes that's a very reasonable business choice: vendor lock-in is extremely expensive at scale. But it does often come from ignorance. On the other hand, almost every developer has written a few database-backed applications, but very few have written anything in the NoDB style. Many can't even conceive of coding something without a database. I'd love to see both sorts of ignorance reduced.


Ever hear of ANSI SQL?


Which is not turing complete. You need database specific extensions to get that.

Edit: Also, not all databases follow the standard very closely


What have you needed out of ANSI SQL that is a gap in its Turing Completeness? Totally serious. A great many things can be dismissed as not being Turing Complete, so please provide us with some examples of why this is bad in ANSI SQL.


I did not mean that ANSI SQL was bad. However, by not being turning complete it has fundamental limitations that limit it from expressing certain logic (as you might need to do in a stored procedure). This frequently means that you must use proprietary extensions to SQL (such as PL/SQL) to accomplish these tasks.

My interpretation of the parent post was that it was a response to a comment about vendor lock in. I was only trying to point out that it is not always possible to ensure compatibility between databases by writing strict ANSI SQL.


I'm sorry but I don't even know what you are talking about. Who cares if ANSI SQL is Turing complete?

It stores data just fine and is not vendor specific.


You care about Turing completeness if you are trying to express something that requires it (which is something that frequently needs to be done in stored procedures). Also, SQL is not a data storage engine it is a query language.

ANSI SQL is not vendor specific but it is just a standard not an implementation. As a result you have to rely on vendors to implement the language. Many vendors deviate from the standard. This means that you cannot just write ANSI SQL and expect it to work on all databases.


Aren't vendors finally starting to abandon their proprietary crap in favor of SQL/PSM?


Interesting, I didn't know about SQL/PSM. Although judging by this:

http://en.wikipedia.org/wiki/SQL#Procedural_extensions

It looks like a lot of vendors still only support propriatary extensions.


> In my view, direct access to data decoupled from application code is a bug, not a feature. With multiple code bases touching the same data, schema improvements become nearly impossible.

It's unclear that multiple applications with direct access to said data make schema improvements any easier.

The obvious solution, copying the data for applications that are using the new schema, pretty much guarantees that one or more of the copies are wrong.

> I also think data integrity is easier to maintain with a system like this. SQL constraints don't allow me to express nearly as much about data integrity as I can in code. Sure, I could use stored procedures, but if I'm going to write code somewhere, I'd rather it be in my app.

How do you guarantee that all of the apps that touch that data use the current version of said code?

Code normalization is as important as data normalization.


Very well put. It is not access type (direct vs indirect) that make schema improvements hard, but access preservation: availability. And indeed copying the data often leads to one or more of the copies being wrong, unless special measures are taken in that direction.

> How do you guarantee that all of the apps that touch that data use the current version of said code?

An approach that may be worth considering is to not require all apps to use the current version: allow multiple versions. For some cases this would work, say if the semantics of the newer version are backwards compatible with the semantics of the older version. If the data semantics are preservable, transforming a schema could happen while each data access request to the schema is actively transformed.

But it clearly wouldn't work in all cases. More work to handle that would be needed.

> Code normalization is as important as data normalization.

True, this is the near show-stopper really. In that case, the best one can hope for is preparing the state of the new version (new data in new schema) and carefully coordinating a quick restart of the old version for the new version.

I would love to hear your thoughts on this. We have been working towards that direction with ChronicDB (http://chronicdb.com) and would welcome feedback.


> I would love to hear your thoughts on this.

An e-mail address in your profile would have made that possible. (Chronicdb looks interesting and complements something that I've been thinking about. I suspect that you've implemented many of the relevant mechanisms.)


> How do you guarantee that all of the apps that touch that data use the current version of said code?

In a NoDB app? It's very easy: only one code base ever directly touches the data, because the data lives in the RAM allocated to the app. You give external access via an API, so integrity is very easy to enforce.


Congratulations, you have reinvented the integrity constraint! Except IBM have been working on this for 40 years, making it reliable and performant. I have yet to see anyone roll their own data integrity layer that comes anywhere close to the major vendors.


Good point. With databases distributing data in-memory across machines, shared memory becomes the database. Don't be surprised if Arc runs HackerNews on distributed memory some day...

But one has to wonder, what happens when you need to upgrade the app? Shutting down the process and destroying the memory image doesn't seem like the best option:

- First, it disrupts connected applications since the process is killed, introducing downtime.

- Second, when starting up again in say version 2, the data that will be loaded in memory still needs to be transformed in the format expected by version 2. This transformation can take time on large data, introducing further downtime.

The challenge would be to eliminate this downtime by combining a solution for both client disruption and state transfer. A data abstraction like a database using SQL can simplify such a solution.


"With multiple code bases touching the same data, schema improvements become nearly impossible." Few companies have procedures in place that allow this; but it is possible if you have the right procedures.


Sure, but the problem becomes harder the larger you get. Look at almost any Internet-wide deployment, though, and you see the alternative: isolate database schemas behind APIs, and rev APIs and schemas separately, as the situation demands.


Isolating databases behind APIs and rev-ing API+schema separately is not enough. When the schema changes, data must be transformed to match the new schema version. As you point out this takes too long with a large database, and it doesn't account for data consistency.

We have been working on building what we hope are the procedures for this with ChronicDB (http://chronincdb.com). But it turned out harder than it seems, and we are not sure it will quite work out. We'd welcome feedback.


And not the last thing is security and permissions to access different parts of data. I see no way to have it easily implemented in the event logging system.


Chmod? Seriously, if you need different permissions to stres in an event log, just write multiple event logs - each with only the data they need - and store them with different permissions. This assumes that they can be properly decoupled, but since you're the one writing the event log, you can set it up however you want.


Yeah, right. Adding more pieces to the puzzle makes it more interesting to solve!


When people set out to design a SQL database, they usually end up updating and deleting records. This is bad because it destroys history, and nothing that you can add to your SQL architecture will fix it at a fundamental level.

By basing your system on a journaled event stream, you start with a foundation of complete history retention, and you can build exactly the sort of reporting views you need at any time (say, by creating a SQL database for other applications to query).


When people set out to design a data driven application, they usually end up updating and deleting records.

FTFY...

It's not hard to build history into a SQL table design. You can even store events in a...wait for it... SQL database. I have built numerous systems backed by SQL databases that have complete history retention. Answering questions like 'who had id 'X' on this date 3 years ago' are easily solvable with basic standard sql.

I certainly don't believe SQL databases are perfect or the tool for every job, but in many cases they work just fine until you get into very large datasets. Admittedly, I only deal with databases in the 100s of GB range so I have yet to personally run into the scaling problems that a Google or Facebook have and the SQL backed systems I have built work just fine.


It's not hard, no, but it usually doesn't happen in the average application. That's the issue: it's not built in, it's not standardized, and every SQL database is fully mutabile by default.

If your system operates in this journaled/event-sourcing way at the most basic level then you have the ultimate future-proof storage layer. You could decide to completely change the way the data is stored and represented (in-memory or otherwise) at any time, as long as you have that raw history.


That's one of the things ChronicDB does. It makes historical values available in the average application.


You loose some features of SQL when you have historical values.

For example if you only want customers to create orders you can set a foreign key between customer and order. If you don't delete orders and customers, then the database will allow you to insert new orders against old customers, unless you apply a more complex constraint.


This can be handled by deleting orders from the logical representation, yet preserving them in the historical layer.


Deletes on enterprise SQL systems are usually prevented (the preferred pattern is "mark for delete + purge" similar to a VM's garbage collection). The Application ignores "marked" data as deleted.

That leaves tracking the remaining inserts and updates, which is a well understood problem. It's called Auditing. Here is a simple script that will auto-audit a SQL Server database... variations in other SQL dialects are likely just as straighforward.

http://www.geekzilla.co.uk/ViewECBC0CC3-1C7E-4E7E-B243-F2F25...


You can have a design where your previous version of a record gets automatically copied into another table along with the timestamp of the operation. Then you can slice this history however you want. All with no additional app code.

But I wouldn't write off the noDB approach for various transitional data, or data that isn't mean to live long anyway, like tweets.


Why would the lifetime of the data be at all relevant? Most noDB (a term I don't like much) stores are built to be highly durable.

There are a lot of questions that make the choice of data store a difficult one, but I'm not sure that plays into it at all.


I used to work on an application that did all the typical insert/update/delete operations on the core data tables and retained a journaled event stream in a separate audit database sufficient to regenerate the entire database from scratch (which was done at least a couple of times).

I suppose it would be equally possible to think of the main data tables as a "reporting view" in the sense you use here, except that the application was 1000:1 or more in update frequency to read frequency, and all the reads were performed on the main data tables, so that's kind of a "tail wagging the dog" view of the app.

For an application with different requirements, your view of things might be quite useful, of course.


The storage system of Postgres, one of the earliest rdbms, uses MVCC. The early version even allows you to roll back or query historical data.

http://wiki.postgresql.org/wiki/MVCC


Perhaps I am an old dinosaur, but this article merely annoyed me.

"The key element to a memory image is using event sourcing, which essentially means that every change to the application's state is captured in an event which is logged into a persistent store."

That is a key element of a database. It's called a logical log.

"Furthermore it means that you can rebuild the full application state by replaying these events."

Yup, logical log.

"Using a memory image allows you to get high performance, since everything is being done in-memory with no IO or remote calls to database systems. "

This is _exactly_ what sophisticated old-school databases do. You can have them require to write to the DB on commit, or just to memory, and have a thread take care of IO in the background.

"Databases also provide transactional concurrency as well as persistence, so you have to figure out what you are going to do about concurrency."

Righty-ho.

"Another, rather obvious, limitation is that you have to have more memory than data you need to keep in it. As memory sizes steadily increase, that's becoming much less of a limitation than it used to be."

So why not store your old-school DB in memory?

I can understand the argument that you don't want to lock into a big DB vendor's license path, but the technical arguments here look distinctly weak to me.

Maybe old-fashioned DBs are hipper than people think?


These are all good points but the core of Fowler's article is that the persistence is against the application's object structures directly, with no translation to relational concepts needed (note I am the author of a very popular object-relational library, so I'm not in any way opposed to object-relational mapping...it's just interesting to see this approach that requires none). That it's stored in memory and is reconstructed against an event log are secondary to this.


Initially he makes it sound like there is no translation into a different model but at the end of the article he takes it all back and for good reason:

Also it's important to keep a good decoupling between the events and the model structure itself. It may be tempting to come up with some automatic mapping system that retrospects on the event data and the model, but this couples the events and model together which makes it difficult to migrate the model and still process old events.

So, you're right that he doesn't envision a translation to a relational model but it's not just object structures either.


Right, but the event system in question could be built up nicely in a couple of hours most likely, and the level of "translation" would be minimal compared to OR mapping - no columns/rows/joins/tables/anything else like that.

With such an application I'd probably still be writing the events themselves to a relational database for archiving and potentially sending out report-oriented data as well. I'm not sure how all of that would work out re: ultimately the whole app needs to be stored in an RDBMS anyway for various reasons but it seems interesting to try.


Well, that put it much better than the article did.

Since the points I've highlighted argue that he's talking about database features I'd call that a database without impedance mismatch, like erlang's Mnesia.


When you say old-school DB do you mean something like mysql?


Yes, MySQL is the kind of database that canonically uses approaches like physical logs and logical logs to provide ACID transactions (and, in MySQL in particular, replication.)

The interesting thing that Prevayler and such things did was that they expanded the use of logical logs beyond simple relational tables to much richer sets of state.


Well I guess one difference between what he's proposing and mysql is that mysql forces you to write data abstractions that can fit into mysql. Same goes for NoSQL DBs. Using his approach you don't have to worry about that.

Not sure that justifies dumping DBs altogether, but it's still an interesting advantage.


No matter how skilled I become as a developer, there is always something lurking around the corner to make me feel more naive than ever. As I was reading this article, I realized that my whole career and knowledge about the way applications work is based around the one core idea that when non-binary data needs to be persisted, you use a database.

The idea that you can reliably use event sourcing in memory to persist your data is as foreign to me as it is impressive. Is anyone familiar with major applications (web apps, ideally) that use this method for their data persistence?


You're already familiar with a couple of things that can be built this way: word processors and multiplayer game servers. In both cases SQL databases are too slow and too awkward.

Financial trading is another area where databases are too slow. I know of one place that uses this approach to keep pricing data hot in RAM for their financial models. And Fowler previously documented using this for a financial exchange:

http://martinfowler.com/articles/lmax.html


I wonder about this concept. The reality is that having enough RAM to power NASDAQ, and then being able to accurately reproduce the state of the data following a crash based on input being kept in a durable store - which effectively is IO to the disk, which is the same as, well, just writing to a DB to begin with.

Of course, Fowler talks about 'snapshotting' the data, which, again, makes me wonder if playing with all of this resident memory and the systems needed to make that happen haven't already been solved by...um...databases.


I don't know if it uses "event sourcing" per se, but doesn't HN use in-memory & serialized Lisp data structures instead of a DB?


As did Viaweb.


This is the field we choose.

Expand your thinking in the abstract about what a database is. As 71104 mentions, a file system is also a database. What you are thinking of as "a database" is really a specific type of key-value store that is located on disk. But the fact most DBs are on disk has nothing to do with the concept itself.


It seems like email maps onto this model fairly well. An event (mail) comes in and gets written to the log (mbox). Starting the server (session) entails reading the events from file. It's not a perfect fit, of course.

The big players are backing their services with custom data stores, though.


I've written some programs like this, even to the point of replaying the entire input history every time my CGI script got invoked. It's surprising what a large set of apps even that naïve approach is applicable to, and there are some much more exciting possibilities under the surface.

To the extent that you could actually write your program as a pure function of its past input history — ideally, one whose only O(N) part (where N was the length of the history) was a fold, so the system could update it incrementally as new events were added — you could get schema upgrade and decentralization "for free". However, to get schema upgrade and decentralization, your program would need to be able to cope with "impossible" input histories — e.g. the same blog post getting deleted twice, or someone commenting on a post they weren't authorized to read — because of changes in the code over the years and because of distribution.

I called this "rumor-oriented programming", because the propagation of past input events among the nodes resembles the propagation of rumors among people: http://lists.canonical.org/pipermail/kragen-tol/2004-January...

I wrote a bit more on a possible way of structuring web sites as lazily-computed functions of sets of REST resources, which might or might not be past input events: http://lists.canonical.org/pipermail/kragen-tol/2005-Novembe...

John McCarthy's 1998 proposal, "Elephant", takes the idea of writing your program as a pure function of its input history to real-time transaction processing applications: http://www-formal.stanford.edu/jmc/elephant/elephant.html

The most advanced work in writing interactive programs as pure functions of their input history is "functional reactive programming", which unfortunately I don't understand properly. The Fran paper http://conal.net/papers/icfp97/ is a particularly influential, and there's a page on HaskellWiki about FRP: http://www.haskell.org/haskellwiki/Functional_Reactive_Progr...


Wouldn't this system have a bunch of drawbacks:

- Long startup times as the entire image needs to be loaded and prepared.

- It would be hard to distribute the state across multiple nodes

- What happens in case of a crash? How fault tolerant would this be?

- Does this architecture essentially amount to building in a sort-of-kind-of datastore into your already complex application? Without a well-defined well-tested existing code base, is this just re-inventing the wheel for each new project?

- How do you enforce constraints on the data?

- How do transactions work (debit one account, [crash], credit another account?

- How do you allow different components (say web user interface, admin system, reporting system, external data sources) to share this state?

Just curious.

EDIT:

- Isn't this going to lead to you writing code that almost always has side-effects, causing it to be really hard to test? How would you implement this system in Haskell?


- The startup times can be a problem if you have a lot of data. Modern disks are pretty fast for streaming reads, though, and you can split the deserialization load across multiple processors.

- Mirroring state is easy; you just pipe the serialized commands to multiple boxes.

- It's very fault tolerant. Because every change is logged before being applied, you just load the last snapshot and replay the log.

- It didn't seem that way to me.

- In code. In the system I built, each mutation was packaged as a command, and the commands enforced integrity.

- Each command is a transaction. As with DB transactions, you do have to be careful about where you draw your transaction boundaries.

- Via API. Which I like better, as it allows you to enforce more integrity than you can with DB constraints.


Thanks for the informative response. Just a couple more questions:

> - The startup times can be a problem if you have a lot of data. Modern disks are pretty fast for streaming reads, though, and you can split the deserialization load across multiple processors.

Reading data, at even a GB/second from disk (which is currently not possible) is going to mean a second spent of a GB of data, just to read, let alone deserialize. That's with reading a snapshot, not replaying old transactions.

> - Mirroring state is easy; you just pipe the serialized commands to multiple boxes.

That's not distributing the load. I'm talking about having more data than fits in an reasonable amount of RAM (say 1TB). Also mirroring is nice for when you want read-only access to your data. You'll have the same problem as any other data store when you want multiple writers. Also, is replication synchronous or asynchronous (which end of CAP do you fall on)?

>- It's very fault tolerant. Because every change is logged before being applied, you just load the last snapshot and replay the log.

So it's going to at the speed of the disk then (http://smackerelofopinion.blogspot.com/2009/07/l1-l2-ram-and...). Don't get me wrong, this is still faster than writing to the network, but then writes are way slower than reads.

My other question is how much of a pain in the ass is it to debug such a system? I suppose if you have a nice offline API to look at your data, change something, revert back, etc, it would work well, but if it's deep within your normal application, it could become nightmarish.


> Reading data, at even a GB/second from disk (which is currently not possible) is going to mean a second spent of a GB of data, just to read, let alone deserialize.

If that's just saying that startup time can be an issue, I agree. There are a variety of techniques to mitigate that, though. The simplest is to compress snapshots and/or put them on RAID, boosting read speed. The most complicated is just to have mirrored servers and only restart the one not in use right now.

> I'm talking about having more data than fits in an reasonable amount of RAM (say 1TB).

For something where you need transactions across all of that? This architecture's probably not a reasonable approach, then. The basic precondition is that everything fits in RAM. However, sharding is certainly possible if you can break your data into domains across which you don't require consistent transactions.

> So it's going to at the speed of the disk.

Sort of.

Because it's just writing to a log, mutations go at the speed of streaming writes, which is very fast on modern disks. And there are a variety of techniques for speeding that up, so I'm not aware of a NoDB system for which write speed is the major problem.

Regardless, it's a lot better for writes than the performance of an SQL database on the same hardware.

> My other question is how much of a pain in the ass is it to debug such a system?

It seemed fine. A big upside is that you have a full log of every change, so there's no more "how did X get like Y"; if you want to know you just replay the log until you see it change.

Last I did this we used BeanShell to let us rummage through the running system. It was basically like the Rails console.


- Mirroring state is easy; you just pipe the serialized commands to multiple boxes.

What? No, that's ridiculous. That's how inconsistencies crop up. Unless you plan on locking the entire system during each command.


One box is the master; the others are slaves. And yes, the easy way to do this is system having single write lock.

That seems ridiculous if you are thinking like databases do, in terms of taking away the pain of all those disk seeks needed to write something. But if everything is hot in RAM, executing a command is extremely fast. Much faster than a database.

If that still isn't fast enough, you can split your data graph into chunks that don't require simultaneous locking and have one lock per. For example, if you are making a stock exchange like the LMAX people were, you can have one set of data (and one lock) per stock.


The lock doesn't need to just cover one write. It needs to cover the whole transaction. The canonical example is that of incrementing a counter. Replica synchronization aside, without a transaction lock (or some other guarantee of transaction consistency), at some point you will read a counter value which another client is in the middle of updating.

The first "fix" to this that comes to mind is timestamping data, rolling back transactions which try to write having read outdated data. Do extant NoDB systems do this?


Yes, these approaches provide proper transactions, but the approach is much simpler than you imagine.

Imagine you have a graph of objects in RAM. Say, a Set of Counters. Imagine you want to change the graph by incrementing a counter. To do this create an object, the UpdateCounterCommand, with one method: execute. You hand the command to the system, and it puts it in a queue. When it gets to the top, it serializes the command to disk and then executes it. Exactly one write command runs at a time.

For a real-world example, check out Prevayler. It provides all the ACID guarantees that a database does, but in a very small amount of code.


Mostly agree, but to this point:

> - Isn't this going to lead to you writing code that almost always has side-effects, causing it to be really hard to test? How would you implement this system in Haskell?

You'd have to figure out how to isolate the IO monad as much as possible, but this is no different than interacting with a database in Haskell. And Haskell would give you nice features like STM to address other concerns as well.


Re: Haskell it would probably look a lot like (exactly like?) Happs-State: http://happs.org/


For those who want to take this kind of approach (object prevalence) in Common Lisp see http://common-lisp.net/project/cl-prevalence/

Sven Van Caekenberghe (the author of cl-prevalence) and I used this approach to power the back-end/cms of a concert hall back in 2003. A write-up of our experiences can be found at http://homepage.mac.com/svc/RebelWithACause/index.html

The combination of a long-running Lisp image with a remote REPL and the flexibility of the object prevalence made it a very enjoyable software development cycle. It's possibly even more applicable with the current memory prices.

I especially liked the fact that your mind never needs to step out of your object space. No fancy mapping or relationship tables, just query the objects and their relations directly. I guess that's what SmallTalk developers also like about their programming environment.


we started with cl-prevalence and then of course (NIH-syndrome) implemented our own approach to this back in 2003, which you can find at http://bknr.net/ . We used it back then to run eboy.com, and it still is powering http://quickhoney.com http://www.createrainforest.org/ and http://ruinwesen.com/ amongst others. Those transaction logs + images are for some 6+ years old, and have gone through multiple code rewrites and compiler changes and OS changes and what not. It is good fun, has drawbacks, has advantages, definitely widens your horizon.


yes, smalltalk users can just image persistence or sandstone from Ramon Leon.

http://book.seaside.st/book/advanced/persistency/sandstone


Using the no-DB approach is particularly tempting with a language like Clojure. Clojure can slice & dice collections easily and efficiently. It has built-in constructs for managing concurrency safely.

I actually have a couple Clojure apps that rely on a hefty amount of in-memory data to do some computations. Even the cost of pulling the data from Redis would be too expensive. The in-memory data grows very slowly, so it's easy to maintain. Moving faster-growing data in-process would be trickier, but this article makes me want to try.


This has limited use because of

Maintenance: I can easily give a 10% raise to everyone with a single SQL statement. Fowler's method requires that I first create an entire infrastructure (transaction processing, ACID properties) in code for this particular application. And it had better be as reliable as the transaction processing available in modern relational databases (so says my boss) or I'll be looking for a new job.

Support: you get to teach the new guy how "Event Sourcing" works for this application A and also applications B, C, ....

That said, I _have_ done this with great success. But the work involved a single application (a minicomputer-based engineering layout system). The ease with which versioning could be included was a selling point.

And don't get me started on reporting or statistics.


The "give everybody a 10% raise" case can be looked upon either as a bug or a feature. Sometimes it's nice that anybody can do anything; sometimes it isn't.

As to creating the infrastructure and worries about reliability, there are a number of frameworks for this. E.g., Prevayler. It gives you all the ACID guarantees, but has about three orders of magnitude less code than a modern database.

Supporting it could definitely be a problem. That's true for anything novel, so I'd only do this where the (major) performance benefits outweigh the support cost.

Some kinds of statistics are easier with this. For example, if you want to keep a bunch of up-to-date stats on stocks (latest price, highs, lows, and moving averages for last hour, day, and week) it is almost trivially easy in a NoDB system, and much, much faster than with a typical SQL system.

For other stats and reporting, though, dumping to an SQL database is great. For many systems you don't want to use your main database for statistics anyhow, so a NoDB approach mainly means you start using some sort of data warehouse a little earlier.


I am 100% agreeing with the article, with one caveat.

Database engines are not just for storing - each is basically a "utility knife" of data retrieval - indexing, sorting and filtering are available via (relatively) simple SQL constructs. If your app uses an index right now, ditching the DB will mean re-implementing it manually. It's not hard, but it's extra code.

So basically, the DB engine might still be a necessary "library", at least for data retrieval. A middle-of-the-road take on this is e.g. using an in-memory Sqlite instance to perform indexing, etc - seeding it at run-time to help with data searches, but then still not using it for storing persistent information and discarding the data at the end.


Having built a couple of systems like this, I didn't find this to be a big problem. You have to organize your data in memory somehow, and that tended to be along lines that made for fast access. Occasionally I'd have to index something, which meant adding a hashtable here and there. It also allowed me to index and organize in ways that SQL doesn't make available.

The area where I still most needed SQL was for ad-hoc querying and reporting. I dreamed of building a relational-to-object mapper, but settled for a) XPath queries against our snapshot files, and b) dumping to an SQL database for reports.


The SQL constructs are great, but the biggest advantage to relational databases is that the engine handles your data consistency issues for you. Consistency isn't just about rolling the datastore back to a specific moment in time -- you have to handle locking, concurrent reads/writes, etc.

If you're building a trading platform that handles 6M transactions/second, you have the money to handle this in the application layer and the load to justify the expense. But for many other tasks, you may be wasting money or putting data at risk.


I agree. Unless you know you are building something that will start out needing millions of transactions per second, you are more likely over-designing if you are building a bespoke database.

Standard tools are useful because you can get to working code fast ... this is why LAMP is still such a powerful framework upon which to build. While it may make sense to consider adding a search indexer (Solr) or key-value cache (Redis), for almost every use case, rewriting data storage is a waste.

Also, to paraphrase Ted Dziuba, it probably doesn't matter if your product doesn't scale, because nobody cares, or will ever use it. So I think it is better to get something up and running quickly to see if anyone cares before you bother trying to optimize for the rare case where your product turns out to be the next Twitter.


Actually, the NoDB approach that Fowler describes handles consistency much more simply: mutations are executed serially. If all your data is hot in RAM, then changes to the data are very quick. Thus, there's no locking, no concurrent writes, and no need to worry about transaction isolation. If you're a Java user, the Prevayler framework he mentions provides this in a couple thousand lines of code.


When you're 3-4 orders of magnitude ahead of the game because you've ditched network round trips and disk accesses, indices become less ( not completely un- ) necessary.

The impedance mismatch between database and application is a lot of code too.


I think that the best thing DB-s provide is the separation of skills. I can fully concentrate on the programming side and just be aware of the db side, and the DBA-s will handle setup, replication, migration, analytics, ad-hoc queries, backup, etc.

If, on the other side, I had to do it all myself, I'd most probably have lost my last hair.


I'd encourage everybody to try this out; building an app like this really broadened my way of thinking about system design.

Compared with a database-backed system, many operations are thousands of times faster. Some things that I was used to thinking of as impossible became easy, and vice versa. Coming to grips with why was very helpful.


what implementation did you use? what others exist? thanks.


Isn't this essentially what prevayler http://en.wikipedia.org/wiki/Prevayler is?


The first time I did it myself. The next couple of times I used what the other reply mentions: Prevayler.


I think it's interesting you can distribute more of your "persistent" state to in-memory storage and then distribute snapshots throughout the day. Online game servers often rely on state being in memory rather than being queried on demand. Achieving high performance otherwise is difficult.

However, I wouldn't call this "no-DB." Rather, it's "less-db." Ultimately, historical and statistical data needs to be stored and databases are great for that (and for a stats team).


I spent about a year as a maintainer of FlockDB, Twitter's social graph store. If you don't know, it's basically a sharded MySQL setup. One of the key pain points was optimizing the row lock over the follower count. Whenever a Charlie Sheen joins, or someone tries to follow spam us, one particular row would get blasted with concurrent updates.

Doing this in-memory in java via someAtomicLong.incrementAndGet() sounds appealing.


> Doing this in-memory in java via someAtomicLong.incrementAndGet() sounds appealing.

Just for fun, in Clojure:

    (def current-id (atom (long 0)))
    (defn get-id [] (swap! current-id inc))


I didn't finish the article because I read the one on Event Sourcing (http://martinfowler.com/eaaDev/EventSourcing.html), pretty good pattern. I like that every time he describes a new one (to me), I feel like I have to use it.


"I feel like I have to use it". Sure, as long as its in a prototype and I assume that that is what you mean. The problem is that its this "see a new thing and feel like I have to use it" that seems to be the single biggest source of accidental complexity in production software. So by all means use something new but not because you feel like it, rather because you have thought long and hard about it, tested it and can really justify why you want to use it in your situation.


Event sourcing has a lot of power, but it also offers up some unique challenges. If you are interested in applying it I'd check out CQRS: http://cqrsinfo.com, I've also got a few blog posts on the subject http://lucisferre.net/tag/cqrs/


Nothing new here. I remember working with TED editor on PDP-11. The machine crashed some times. After restart TED would restore the text by replaying all my key presses.

Other example vector graphic editors: it replays vector graphics primitive instead of pixel bitmaps.


What about ROLLBACK? And no, going back in time by replaying logs is no substitute, because you lose other transactions that you want to keep (and perhaps already reported to the user as completed).

What about transaction isolation? How do you keep one transaction from seeing partial results from a concurrent transaction? Sounds like a recipe for a lot of subtle bugs.

And all of the assumptions you need to make for this no-DB approach to be feasible (e.g. fits easily in memory) might hold at the start of the project, but might not remain valid in a few months or years. Then what? You have the wrong architecture and no path to fix it.

And what's the benefit to all of this? It's not like DBMS do disk accesses just for fun. If your database fits easily in memory, a DBMS won't do I/O, either. They do I/O because either the database doesn't fit in memory or there is some better use for the memory (virtual memory doesn't necessarily solve this for you with the no-DB approach; you need to use structures which avoid unnecessary random access, like a DBMS does).

I think it makes more sense to work with the DBMS rather than constantly against it. Try making simple web apps without an ORM. You might be surprised at how simple things become, particularly changing requirements. Schema changes are easy unless you have a lot of data or a lot of varied applications accessing it (and even then, often not as bad as you might think) -- and if either of those things are true, no-DB doesn't look like a solution, either.


In an event-based system, especially a large distributed one, ROLLBACK as a single command to revoke all previous attempts at state mutation becomes impossible to support. Instead of supporting distributed transactions you have to change to a tentative model. The paper Life beyond Distributed Transactions: an Apostate's Opinion (Available here: http://www.ics.uci.edu/~cs223/papers/cidr07p15.pdf ) describes this well.

Basically instead of making a transaction between 2 entities, you send a message to the first reserving some data, a message to the second reserving the data and once you get confirmation from both (or however many entities are involved in the transaction) you send a commit to them.

These reservations can be revoked though. Your rollback has to be managed by an "activity".

Ex: Bank transfers. You have the activity called BankTransfer. It manages the communication between entities and the overall workflow. It starts by sending messages to entities Account#1 with 100$ in it and Account#2 also with 100$. To #1 it says debit 500$. To #2 it says credit 500$. #2 responds first and says Done. #1 responds second and says Insufficient Funds. BankTransfer sends another message to #2 saying Cancel event id 100 (the crediting).

Other activities that want to read the state of number 1 will see 100$ in it but the transfer (as yet unconfirmed) had been of 50 rather than 500$ and another debit of 75$ comes in it would respond insufficient funds. At this point it's the activity's job to decide what to do. Wait and try again? Fail entirely and notify any other entities relevant to the workflow? That's up to the business rules. Also, since the credit has not yet been confirmed, reading the balance on #2 would still say 100, not 600$.

Of course, depending on your use case you may want the read to return the balance with unconfirmed transactions. That's entirely up to the application code and business rules but the example should be explanatory as to how rollback is implemented.

Eventual consistency is the only scalable way to go for very large systems.


My understanding of Rollback in Event Sourcing is that there IS no rollback.

If an event causes the model to be in an invalid state, another event must be triggered to rectify the model into a valid state. (Simplistically speaking)


Why not go for NoDisk solution too - just RAID your memory and back it up with ultracapacitors?


You don't even need the capacitors or battery backup if your memory is sufficiently distributed.


Hmm, not so sure about that - power outages can affect several city blocks and then you're introducing latency?


Put your memory on separate continents. Yes, it introduces latency, even more latency than a disk seek, but not that much.


Running Redis in journaling mode is essentially this, too. Mongo can run like that too.


Re: Redis, only if you never compact the append-only file.


Any time you have someone who is not a programmer who wants to maintain code, you will have a DB. And this almost all the time.


While this article wants to establish additional layers above the filesystem, I always wondered how comparable modern filesystems are to key-value datastores.

As far as I can see, they seem to be comparable to b+tree indexed key value stores. A key would e.g. be "/home/user/test.txt". Thanks to the B+Tree "indexation" you can do a prefix scan and list folders (e.g. "ls /home/user/"--> all keys starting with "/home/user/").

In the case of e.g. ReiserFS they actually use B+Trees. They have a caching layer managed by the OS. Most of them have journaling which would be the equivalent of a "write ahead log".

Map reduce based "view" generation can easily be done by pipes and utilities like grep. We might be even able to do some sort of simplistic filtering/views/relations using symlinks.

I guess the main difference is that they aren't optimized for this database-like behavior from a performance standpoint and that the network interfaces to them are SMB/AFP/NFS.


Facebook were using the filesystem for storing photo's and then moved to Haystack which is essentially an append-only log similar to BitCask. The problem with using the filesystem as a KV store is that for every item stored, you're storing a whole lot of filesystem specific meta-data: created, updated, permissions etc.


That sounds interesting. However, as soon as you have several distinct applications that share e.g. the same master data, how do do interface them? You will have to design the in-memory transactional store as a kind of global component. Then, not much is left until you end up with a real database.

I am working on a team that is building an in-memory SQL database. It features a custom language that makes it possible to push the time-critical, data-processing parts of the application directly to the database, which allows for the same speed as this no-DB approach. But you don't have to build your own DB and do everything yourself (correct persistence, backup, transactions...)


UUIDs should let you merge any two streams of events without conflict (unless there are other business-layer data constraints that would be violated).


EventSourcing/CQRS (Command/Query Responsibility Segregation) is gaining a bit of traction in the .NET community. There are some great presentations[1], blogs[2][3] and projects[4] related to this architecture.

[1]: http://www.infoq.com/presentations/Command-Query-Responsibil...

[2]: http://www.udidahan.com/?blog=true

[3]: http://blog.jonathanoliver.com/

[4]: https://github.com/joliver/EventStore/


Greg Young, himself, asserts that CQRS is NOT an architecture. And also asserts that CQRS itself has nothing to do with Event Sourcing.

http://codebetter.com/gregyoung/2010/02/16/cqrs-task-based-u...

"CQRS is not eventual consistency, it is not eventing, it is not messaging, it is not having separated models for reading and writing, nor is it using event sourcing."

It is gaining a lot of traction simply because it seems like a complicated and cool way to solve an uncommon problem. I fear that Event Sourcing will soon become a over-engineered hammer for the wrong nail.

If it were me, I would use CQRS principles, with a relational backend as my source model. Then when the need for scale arises, use ETL to either non-relational db or no-db for queries.


This seems like an excellent idea. Is it possible to preserve most of the APIs/design patterns we're used to working with for (no)SQL when using Event Sourcing?


Having a database model and an application model in separate processes, and mapping between the two is expensive in many ways.

There are many ways get rid of the problem with getting rid of the database. For example putting all knowledge of the domain in the database via stored procedures, couchapps, and object stores.


answering directly to the subject: i do hope so. SQL too often introduces only a layer of complexity between the server-side application and the storage, while most of times an application could be designed to just use the filesystem, which is a database on its own by the way: it's a big, usually efficient, lookup table that maps keys (file paths) to values (file contents).

why store passwords through SQL when a server application could just use a specific directory containing one file for each user, each file named with the username and containing his password (without any file format, just the password, possibly hashed or encrypted)? the operating system I/O cache should be able to handle that efficiently and the advantage would be the elimination of the dependence on another software, the DBMS.


One reason would be that you expect more than a few thousand users.

More generally, for all but the very simplest apps I'm afraid can't agree with you (though, to clarify, this isn't really addressing the topic of the original post). Years ago I had a summer job maintaing a set of Perl scripts that persisted online store inventories to flat files. It was horrible. I've never found Postgres to add much complexity to a project, but on the storage side it provides me with the reassurance that smart people have thought hard about issues of consistency, stability and performance -- which millions have tested -- and on the query side, I can do various clever things at amazing speed if ever that's required. Why would I not go for that?


Most file systems don't gracefully handle 100k files in a single directory. It's not what they are designed for. Disk block sizes also mean that each password may be consuming several kB. The space waste on a modern system might not matter, but inode exhaustion might be a real issue on Unix-type filesystems.


> each file named with the username and containing his password (without any file format, just the password, possibly hashed or encrypted)?

Because your advertisers want to know how many users signed up last month, last six months, and last year. When you only consider one use-case for your data, it's easy to consider using NoSQL or the file system to store your data but in doing so you fail to imagine all the other ways you might want that very same data.


ctime, mtime, atime.


Add one other criteria to that -- say location -- and it's already useless.


Well -

I have written several systems that used that sort of approach. Pretty soon you realize you just reimplemented SELECT, and you did a buggy, half-baked job of it.

If you have money to burn on speed & reliability, dropping the database is a good idea. Otherwise, I simply have written too many half-baked hardwired select queries to recommend it.

At this point, I'd rather do some kind of in-memory SQLite with a persistent MySQL/PgSQL backend.


This was basically the point of view behind ReiserFS, and mp3.com funded the Namesys guys for a while on that basis.

Also, see maildir and various things associated with qmail.


It looks more like a different DB engine implementation than no-DB system... you still need a structured data persisted somewhere, difference here is in the way you store it and how you buffer it, but IMHO that all can be seen simply as in-memory "DB engine"


I personally use SQLite all the time. I prefer it to any of the other solutions.


Two years ago I used an IMDB (In-memory database or Main memory database) in a project. I think it was CSQL. I think this is a nice way to have full ACID and great performance!!


Is "event sourcing" similar to the acid-state business used in happstacK?:

http://happstack.com/index.html


Martin Fowler to usurp Prevalence pattern: https://gist.github.com/1186975


I use Redis for this. I imagine it would be possible to create certain js objects that automatically persist to Redis.


I've always wanted "no-DB" to the level of it being part of the platform/language.

I've always thought that software-transactional memory and persistent distributed heaps would get us there. Unfortunately the nearest things have been Redis and Terracotta plugged into Clojure. It should be:

Insert? new an object. Delete? dispose an object. Look up? Hash table.

Solved problems that just require persistence.


There have been things that are a lot nearer than that. ObjectStore and GemStone, for example. And of course there were transparently persistent platforms like KeyKOS.


There are some neat little libraries that help with this in Clojure. The basic idea is that when you introduce changes through transactions, the actual transaction (code) is appended to disk (which is very fast), and this becomes your "database" file. So, what is persisted is just a list of state changes that are "replayed" to restore state.

The individual transactions could also easily be distributed to multiple nodes via a message queue.


link?


...and on a mildly related subject: more people should consider LDAP.


You mean using data stores that support access through LDAP as general purpose databases?


Under certain circumstances using LDAP as a directory or data-store and not just for authentication alone can make sense, especially if you want to benefit from the very well standardized, open and stable interface or if some sort of multi-master scenario is needed or if you want very rigid control over who can see what portion of the data then the most popular LDAP servers offer a lot of very cool ways of "modelling" and managing your data.

One drawback to keep in mind is that LDAP is generally not meant for lots and lots of writes so it is by no means a substitute for DBs but it is great for looking up data and if that data somehow fits a sort-of "file card" paradigm anyway and there are way more reads than writes on that data and several different applications should be able to access it then all the better.

The major and most popular applications of LDAP, however, are certainly always somehow connected to authenticating users and that is also where it really, really shines and that was another reason I brought it up. If applicable, personally I would prefer managing users and their logins in an LDAP server over keeping all that in a database.

Luckily nowadays most (web) applications offer some sort of support for using LDAP anyway, however dodgy those implementations sometimes are. (One of my favorite examples here is netscape navigator/mozilla/thunderbird and the addressbook schema shenanigans...)

I just think it gets too little credit or news these days but that probably stems from the fact that it is a pretty stable system without lots of innovations and it has been around a looong time and it is not so "sexy" anymore and most HN hackers wouldn't have to deal with it most of the time anyway.

But I cannot recommend looking into LDAP and playing around with it and understanding how to get a directory going enough - it is a bit confusing (sometimes frustrating) at first because it is so different from typical databases but it is fun once you get the hang of it and learn to appreciate its simple and efficient beauty and some of the things you can do in huge directories with e.g. the Sun LDAP server are nothing short of amazing.


I worked with LDAP servers for a few years, but never liked it much. Perhaps I missed something. What can you do in huge directories that is so amazing?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: