Hacker News new | past | comments | ask | show | jobs | submit login
A New Approach to Databases — Simulating 10Ks of Spaceships on my laptop (paralleluniverse.co)
143 points by pron on Feb 27, 2013 | hide | past | favorite | 64 comments



Correct me if I'm wrong, but doubling the performance of the simulation from a dual-core to a quad-core processor doesn't necessarily mean you have ”linear scalability” unless you can demonstrate the trend continues as more cores are used (e.g. 8-, 16-, and 32-cores).

I think the ideal way to demonstrate this kind of claims while eliminating the effect of other factors such as memory performance is to use a beefy server with many cores (let's say N >= 16), and adjust your benchmark to use 1, 2, 3, ..., N cores, then plot the line to see if it's really linear.

Two dots make a line in geometry, but you need more dots to project the trend.

Time to spawn an EC2 Cluster Compute Eight Extra instance!


You are absolutely right, but we'll do that as part of the next step, when we show scaling beyond a single machine. Big servers, and more than one.


Cool! This looks very interesting!


It seems like the author considers 10K spaceships a lot. That is something like 1MB of of data. Not that much, even for a single core.

Also, there should be multiple phases like, first everybody shoots, then everybody is hit, then blast force is applied. Otherwise there might be an unfair advantage for some spaceships, which can destroy their opponent, before it had the chance to return fire.


It doesn't sound like a lot if it is processed linearly. However, 10,000 reader/writers on a single shared resource is a lot.

Naive implementations would have a single global lock on the resource, resulting in lots of contention. They appear to be saying that by converting to callbacks, they are able to determine which transactions will interfere with each other and limit the contention to just those portions. Deadlock would be a serious possibility here, same as in regular databases.

The size of the data is largely unimportant for this problem, it's the contention that's the issue.

If the spaceships didn't have to interact through the global state then you could have sharding. You can change the problem by adding "planetary systems", and saying that ships in one system can't interact with another. Instant shard.


Right. That's the whole point: shards are a solution that imposes constraints on the application, meaning it's not really a solution.

Also, SpaceBase guarantees no deadlocks.


According to the documentation, it does this by preventing callbacks from initiating transactions on the same SpaceBase store?

:) :) :) Isn't that a constraint on the application? :) :) :)

I think I'm going to have to look at the demo and see how a shot which triggers an explosion then does the query to perform the push of the surrounding ships.

I would expect something like:

  Ship A:
     Query(origin+direction+range, callback Shot())

  Shot()
     if Exploding
       Delete
       Query(origin, range, callback Boom())

  Boom()
     Update location
Except that's not allowed?

An AOE which causes cascades of death would be cool to see too.

Nifty idea.


I guess we must have missed all the places in the documentation where this is mentioned. As you see in the code, this limitation is no longer in place.


The real-world case for this is EVE Online, which has a fair amount of trouble handling 10K spaceships in a single battle. If I'm ever working on an MMO again I'm going to want to take a hard look at this tech. Some MMOs wouldn't benefit from it; on the other hand, I'd like to have the ability to do so rather than artificially constraining my game design space.


The data per ship in EVE is far more than the couple of data points in this simulation.

I once wrote a ship builder and simulator for EVE and man there's a lot of variables to take into account: the ship, the pilot, modules, implants, boosters, etc.

Still, like you said, very interesting technology and hopefully helpful to CCP!


Yeah, this wouldn't handle the whole battle, but even just off-loading the positional information would be huge.


First, consider the numbers: 20K ships * 5 times a second * (1 query + 1 transaction) = 100K queries + 100K transactions per second (each query returns about 5 neighboring ships). And the point isn't sheer speed, but scalability. The point isn't running the simulation as fast as possible, but demonstrating the database's capabilities.


But it's also a valid point; it's important to know if it scales well for larger datasets. I think the premise is good, we need better tools for creating multithreaded, concurrent applications. Looking forward to your future posts.


"But if the new Moore’s law gives us exponential core-number growth, to exploit those cores we need to grow the number of shards exponentially as well, meaning we’re placing and exponential number of constraints on an application that probably only needs to accommodate a linearly increasing number of users, and that’s bad."

I'm no expert on parallel programming, but isn't this sloppy thinking? Like the implication is that we must have an exponential increase in cores in order to accommodate a linear number of users...

More generally is anything about this approach really novel? Databases can already run transactions and queries asynchronously. Any language that supports concurrent programming can take advantage of this, even if the database isn't issuing a callback. And ruling out concurrent languages that are capable of implementing this approach, but include some language features the authors don't like seems like a side issue.

Not just trying to snark... if someone can explain what I'm missing I'm open to it!


> Like the implication is that we must have an exponential increase in cores in order to accommodate a linear number of users...

As the article says, you can just ignore those extra cores. But, if you want to do cooler stuff with your data you'd better use them, and if so, sharding is a dead-end.

> More generally is anything about this approach really novel?

I said it's a repurposing of an old idea, that of the old RDMSs where the application was the database. The idea is not about concurrent transactions but about uniting the application and the database and letting the database run and parallelize the application's logic, and be the engine of scaling. I think most people see the database simply as a data storage component, and try to make it fast and scalable enough. We say, let the database run the application.

> And ruling out concurrent languages that are capable of implementing this approach...

I haven't ruled out any language. In fact, the demo is in Java, which certainly doesn't qualify. Any language can do this. I only said that in order to go forward, in order for the common programmer to take advantage of modern hardware, we must use languages built specifically for modern hardware. I've named two that I think fit the bill, but if other languages end up becoming the language for the new age - it's all the same.


Thanks for your response. I think I understand your point re: cores, although I wouldn't have used the argument quoted above, instead I would just make the point in your response - we have all these cores, gotta use them.

Since I found some things to nitpick I want to say I definitely like the approach you are coming at this from -- in terms of how you apparently view sharding, NoSQL, database as a processing engine, etc.

But I still think you are presenting it as something novel (just look at the title) when what you are really doing is promoting a long-established but sorely neglected philosophy... a good philosophy, which needs to be promoted. The novel aspect seems to be supporting callbacks in your Spacebase product, which isn't particularly essential to this philosophy.

Thanks for the article.


I think callbacks are absolutely essential. How else is the db supposed to schedule and run your business logic?


Okay I have been thinking about your response for a while...

Database engines already possess sophisticated scheduling logic for query processing. We are constantly taking advantage of that, in different ways, when we delegate to the database.

And even though we are supposed to have all these business layers and whatnot, I think centralizing logic in a DB is a pretty common practice.

So I have been asking myself, is there a significant functional difference between the db doing a callback to the program, versus other methods of interacting with that built-in scheduling engine?

My supposition is that in the absence of such a capability, one could always get mostly (or fully) equivalent behavior by some combination of

(A) having the app fire queries (and consume their results) asynchronously

(B) giving queries some other means of initiating a response, e.g. writing to a table that acts as a message queue.

To me it seems like such a transformation wouldn't even change the balance of code between app vs DB all that much, it would just change the calling mechanism, and maybe you need these couple of extra modules to manage your query submissions and asynchronously dequeue stuff.

So I admit this qualifies as more than syntactic sugar, but not sure goes much beyond that. Of course, many prized language features don't go beyond that at all; and calling/concurrency idioms are important.

I like the idea of centralizing logic in the DB and I can see how your approach allows this in a much more elegant and straightforward manner. So I think I get it. Whether it's a totally new architecture, I dunno...

I come out of this thinking if there was a way to improve your article it would perhaps be to avoid the claim that this is allowing us to take advantage of database to "schedule and run" whatever business logic is implemented in the db via the highly optimized query processing facilities. All programs effectively do that in their use of the database, which is why the scheduling capabilities exist in advance of your approach.

The core thread of what you are doing/advocating, as I understand it, is something like this. (a) Centralize business logic in the DB. (b) Stop being ashamed of it. (c) Support calling conventions that make it easier.

If you made the argument that way, I would have seen it as a slam dunk. And I would expect that to be a controversial argument on HN.

Interested in your thoughts.

(BTW, I bet one could do these sorts of callbacks using the relatively recent capability of SQL Server to run Stored Procedures written in .NET.)


Yes, well, you summarize most of my points but one: the importance of callbacks is not just in the database triggering application actions (as would be done with asynchronous queries or the message bus you mention), but by scheduling application code to run on appropriate CPU cores.

Our ultimate goal isn't to make a new programming model for its own aesthetic (or non-aesthetic) sake, but in order to take advantage of multiple cores -- namely, for the sake of performance. So the database does not "control" the application -- it actually parallelizes it.

And as for SQL server, yes, I guess it's a similar concept but a different kind of database. I advocate that the database should be part of the application; they should be sharing the same heap in the same process, and only then should the applications let the database "drive". That is why I also don't quite agree with your wording of "centralizing logic in the DB" because the DB and the application are one; certainly the DB is not more (or less) centralized than the app. Once you get that, there is no point in being ashamed of it. Just like you let your web-app container parallelize your presentation layer, you let the database parallelize your business logic. In both cases the middleware shares the same process as the app, and the leap of faith required isn't big at all.


Thanks for your response! I think it addressed my comments well.

It does make me think about the fact that almost every major DB today runs out of process and that the market seems to have selected this pretty definitively. It seems like only recently some products are running in-process for performance reasons. But if I understand the trade offs here, only a single application can run in-process with the db. Also the db is not protected from crashing when the app crashes. But this is the 1990's explanation... any thoughts on how that applies to your situation?


> almost every major DB today runs out of process and that the market seems to have selected this pretty definitively.

True. That's why I've tried to show the performance to be gained by letting the DB become integrated with the app. This is a result of the DB having just the right kind of knowledge about the relationships among domain objects to allow it to parallelize business logic.

> But if I understand the trade offs here, only a single application can run in-process with the db. Also the db is not protected from crashing when the app crashes.

Well, if the application and DB are one, so what? Note that we're only talking about OLTP DBs here. I don't see much of a reason for analytics DBs to be unified with the app. How often do you need more than one app working with the same OLTP data?

However, if you absolutely must, there is no reason why the in-process DB shouldn't store its data in shared memory, thus allowing other DB instances to access the information. There are ways to ensure that if one app crashes the memory is left in a consistent state.


>> How often do you need more than one app working with the same OLTP data?

If you are running a modern web startup, it's probably not an issue. But in a typical business... often IMO. And it was one of the original use cases that drove the adoption of databases in the first place.

Would be interested in seeing you blog about the last sentence of your last reply.

Thanks for engaging my comments.


OT: your overlay that appears at the top of the screen if you scroll down far enough means that when you click a footnote you can't actually see the footnote without scrolling up a little.


I'm confused. Is the author trying to convince me to give up the expressivity, portability, and readability of my Python app-layer codebase for the performance gains to be had by writing my logic in stored procedures?

Having moved one codebase OUT of T-SQL stored functions and onto Django's ORM, I can say that this article does not have me convinced the way forward is to move the code back INTO the DB. If my data model is going to be dictating my development toolset, it better be giving me some amazing performance gains and be as good as the toolset I've come to enjoy in a post-ruby-on-rails-world.

What am I missing?


You don't have to give up anything. These "stored procedures" are not what you think they are, and neither is the database. The idea is that the db and the application become one - not that you have to program in plsql or something. You could continue to use python or ruby.

And yes, the performance gains are amazing.


OT, but would you like some constructive criticism on usability of the article? The lack of contrast made it difficult to read for me; clicking on a footnote scrolls in such a way that the footnote text is hidden by the header; and having a link that says "Discuss on Hacker News" and also a comment section under the article sends a mixed message.


What are the pricing options for spacebase? I can't seem to find them


email us at info@paralleluniverse.co and we'd be happy to give you the details.


Ugh, maybe i'm the only one that hates this but if I have to spend time/effort just to get a price for something it's already off the table as a choice.

This makes it seem like it's either too high priced and you know it or the pricing scheme is too complicated and should be simplified.


No, we actually kinda hate it, too :)

It's just that as a young startup, it's very important for us to learn about our customers. At this stage, it might be beneficial for us to offer you a fantastic deal if we think we there is much to gain by the partnership. Also, learning about the problems our customers face is more important to us than sheer sales volume at this stage. That's why we want you to talk to us. But we're sorry for the inconvenience.


I work freelance. When someone wants a price from me, they want a number. It might not be the number they want to hear. It might not be an accurate number. A lot of the time, it won't be the actual number used, and it will probably not have any relation to the final invoice. There are an infinity of reasons why you might not give someone a number when they ask for a price, and they're all irrelevant.

It's okay to qualify the number. "Ten million dollars, but order now and..."

If you'll note, your customer just said the same thing I did. Listen to him.


The problem sounds not dissimilar to hydrodynamic simulations, for which people have been using MPI for >20 years. I'm curious why you didn't use it. Can you speak to this?


Because our purpose was not to show how fast a simulation can run. We haven't tried to come up with a new approach to simulations. We've tried to present a new approach to databases, and show an example where a naive implementation with the right database can have decent performance.

Also, the very same code would work if actions were triggered by a network request rather than by a main loop. We don't assume anything about how the spaceships are coordinated. If you coordinate them to run in lockstep, you could get better performance. But, again, we've tried to show the viability of the naive implementation, where nothing is assumed.


Is "Real time spatial database" just a fancy word for another distributed, in-memory datastore? There are many implementations of distributed data structures, and many are used for in-memory analytics. I realize spacebase also minimizes remote lookups/puts in the average case. I just want to understand the "spatial" aspect they try to integrate into their product.


The spatial aspect comes from the ability to efficiently answer queries such as "what is set of objects within the following cube?" , etc. Look up R-Trees if you're curious.


I'm no database expert but won't the limitations of scaling across machines largely limit the benefits of this system?


Probably not -- if you're using the right data structure, that is. I've written about the theoretical performance here, http://highscalability.com/blog/2012/8/20/the-performance-of..., and pretty soon we'll publish some empirical results.


I wonder if this type of approach might push programming language developers to integrate databases into the language itself beyond just giving the programmers a interface for the database.


This is more a UI/theme issue than anything else, but clicking a footnote places that citation behind the sticky header/search bar in FF17


>First, this means using a programming language built for concurrency. At present, there are two such languages in wide(ish) use: Erlang[2] and Clojure.

How exactly does clojure qualify while go and haskell do not?


Oh, Haskell qualifies, it's just not in wide(ish) use yet. Also, I don't know whether it supports mutable concurrent constructs (as Clojure does with refs and STM). Without those you don't have concurrent writes.

As for Go, it allows sharing of mutable state, and so does not prevent races and does not make it clear when modifications to said mutable state are visible. It supports concurrency well but is not built around it.


Haskell has marvelous concurrent constructs. It's actually one of the few places where I end up thinking "wow, I'm glad I get to use threads here" and I'm not thinking it sarcastically.

http://hackage.haskell.org/packages/archive/base/4.6.0.1/doc...

http://hackage.haskell.org/packages/archive/stm/2.4.2/doc/ht...


The RedMonk ranking from September 2012 [1] seems to contradict your statement about Haskell prevalence. Note, it edges out Erlang on both metrics.

[1] http://redmonk.com/sogrady/2012/09/12/language-rankings-9-12...


What is RedMonk, and how is their ranking less B.S. than rankings according to Netcraft?


With OTP it's pretty easy to make an Erlang program with race conditions.


>Oh, Haskell qualifies, it's just not in wide(ish) use yet.

Seems to be more widely used than clojure.

>Also, I don't know whether it supports mutable concurrent constructs

Yeah, Control.Concurrent.STM has vars, queues, channels, etc.


Clojure runs on the JVM which means that it's 'politically usable' in many more places than Haskell


I don't dispute that this is true, but I still don't understand it. How have we got to a state where native code isn't politically acceptable?


Native code is just too expensive to write/maintain and is not justified unless you need the extra constant factor performance it gives you, which is quite unrelated to scalability.

In this case, at least, industry seems to be doing it right. C++ is reserved for the industries that really need it (gaming, embedded) and enterprise work focuses more on programmer productivity instead.


So Haskell is damned with C++'s brush despite being safer than Java...


Haskell is completely managed, its not very native at all. They have a garbage collector right? Since everyone is compiling to native instructions these days, GC (and the memory safety required for it) is about the main difference.

At the end of the day, each one of your VMs are running on hardware, the point about Clojure being JVM-friendly is more about ecosystem and interop then some efficient execution concerns. You could say Haskell is interoperable with everything and nothing, and be right on both counts.


> the point about Clojure being JVM-friendly is more about ecosystem and interop then some efficient execution concerns.

If you had only written that, you wouldn't have been downvoted. Almost everything else you've written here is wrong.


it will also be dead simple to integrate with spacebase, which is java


Yes, that is certainly the theory. Have you seen that actually happen in practice? Java shops almost exclusively continue to use java, and refuse to allow the dozens of other jvm languages. Every measure I can find shows haskell is more widely used than either clojure or erlang.


Can you share your Haskell vs. Erlang data?


Just the usual suspects (redmonk, langpop, tiobe).


As a professional practictioner of all 3, haskell is emphatically not used more widely than clojure or erlang.


How does being a "professional practictioner of all 3" qualify one to make that assertion? And how do you reconcile that with the fact that every objective measure available shows otherwise?


Link to the objective measures?

The qualification is admittedly subjective but Haskell is not as practical for use in many use cases, whereas Erlang fits a niche very well, and the others (scala, clojure) are more general purpose. My assertion is based on what I know all my peers in the industry to be working on, and the sizes of the projects they are working on. If you have more data than this empirical data set, I'm all ears.

Edit, I see you reference tiobe and langpop.

This is a horrible* way of assessing industry usage, just fyi. Niche languages for example, may see relatively small general usage, but may be the de facto answer to a particularly type of problem. Many people use Erlang to solve their concurrency problems. Few use haskell (although that part is improving and I love haskell a lot).


>but Haskell is not as practical for use in many use cases

Except that is entirely nonsense. Your misconceptions about the languages in question do not imply a particular amount of usage from other people.

>the others (scala, clojure) are more general purpose

I can't even imagine how one could make such an absurd statement. All that demonstrates is that you have not actually used haskell as you claim, or you only used it 20 years ago.

> My assertion is based on what I know all my peers in the industry to be working on

Which is ridiculous, just like claiming haskell is more widely used than C by only looking at the financial industry.

>This is a horrible* way of assessing industry usage,

It is a poor way, but it is infinitely superior to the "cause I said so" method you are employing. Erlang and haskell rate close together in all of them, so it certainly makes no sense to pretend one is more widely used than the other. Clojure rates miles behind both, and it is equally absurd to pretend it is so far behind in every measure but is actually ahead based on nothing.


Oh really? Haskell is not suitable for practical use? Try getting any software firm to adopt it and explain to them how I/O works with monads and about applicative functors. You will be met with resistance every step of the way.

I love Haskell and use it nearly every day to do mathematics and a number of side projects. Please step off your high horse and throwing all this ad hominem around. You're giving haskellers a bad name everywhere.


Haskell BEAUTFIFULLY qualifies, its just that STM and the other abstractions have been in GHC since around 2005-2007, so they're hardly "novel" in that community.


Indeed. All 4 are reasonably mature, have excellent concurrency and parallelism stories and have been used practically for this purpose.

The author is right about there being few languages that do this, though.


.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: