Subtly Bad Things Linux May Be Doing To PostgreSQL

jandrewrogers · on April 3, 2014

Issues like the ones raised are why all sufficiently advanced database engine designs tend to evolve toward a kernel bypass architecture. From the perspective of a database engine, operating systems do a lot of "dumb" things with resource management and scheduling in ways that are essentially impossible to avoid that the database engine has enough context to do intelligently on its own. OS bypass in a database can have substantial performance benefits and add robustness for many edge cases. The obvious downside is that you basically end up rewriting the operating system in userspace, minus the device drivers, which is an enormous amount of work. It is not an incremental kind of design approach and it makes portability difficult. PostgreSQL actually does a really good job trying to be fast and robust without going to bypass internals.

I've always asserted that the reason we've never seen a proper kernel bypass database engine in open source is that the Minimum Viable Product is too complex. A bare, stripped-down, low-level database engine that does full bypass of the operating system is usually at least 100kLoC of low-level C/C++, and that is before you add all the features a database user will actually want. That is a big initial investment by some people with fairly rare software design skills.

jeffdavis · on April 3, 2014

I assert that open source databases don't do it because it's a bad idea. It's the kind of you do when you have a lot of spare engineering resources and not many innovative ideas.

Not only is the initial devepment expensive, so is the maintenance burden. It makes every new idea cost more to implement.

Postgres has been extraordinarily innovative; offering things like transactional DDL, advanced indexing, first-class extensibility, serializable snapshot isolation, per-transaction durability, sophisticated constraints (e.g. non-overlapping ranges), etc. These features are possible because postgres didn't get bogged down reimplementing and maintaining a filesystem.

And what would all of that work gain, anyway? +25% single-node performance? Not a very strategic direction to go for databases. Better to improve JavaScript/json support and get a good logical replication solution in place.

That being said, there are some low level features that are really worth doing. Robert Haas (the author) did some great work with lockless algorithms, which has achieved great concurrency with a manageable maintenance burden.

jandrewrogers · on April 3, 2014

You both overestimate the engineering complexity and underestimate the benefits. I've both designed and worked on a couple different bypass kernels as well PostgreSQL internals over the years.

You are correct that the initial development is steep. However, once the infrastructure is there it really is not much different than working with the operating system infrastructure and you gain a level of predictability and stability in terms of behavior that saves engineering time. Also, bypass implementations have almost no locking internally (either "lock-free" types or heavier types) which reduces complexity considerably.

Some bypass kernel code bases allow you to compile with the bypass implementation disabled, using highly-optimized PostgreSQL-like internals. I've seen and run quite a few comparative benchmarks on the same design with and without bypass enabled, as well as absolute benchmarks against engines like PostgreSQL. We don't have to guess about single node performance.

Broadly speaking, a properly designed bypass kernel buys you 2-3x the throughput of a highly optimized non-bypass kernel in my experience. If it was only 25% no one would bother. Furthermore, for massively parallel databases, you essentially require a bypass kernel to design a well-behaved system due to the adaptive operation scheduling requirements.

I agree that it is a lot of work but it is also entirely worth it if you need to either (1) maximize throughput on a single node and (2) build a well-behaved massively parallel database kernel. The differences are not trivial.

jeffdavis · on April 3, 2014

Any number we have is going to be sensitive to the workload, so I think it's unfair to say 2-3x without a lot of context.

Also, you dismiss ideas that help the database and the OS work together better. For instance, I did "synchronized scans" for postgres. It coordinates sequential scans to start from the block another scan is already reading, improving cache behavior and dramatically reducing seeks. This could have been done by lots of extra code controlling the I/O very carefully (as at least one paper seemed to suggest was a good idea). But I chose to do it the simple way, just start the scan off in the same place as another scan, and concurrent scans got almost ideal behavior -- each ran in about the same time as if no other scan was in progress (with no overhead in the single scan case).

Linux is clearly interested in allowing more hooks and making them more useful. From an engineering standpoint, that makes more sense to me.

Two other points:

* I'm a little skeptical that such a bypass can easily be made resilient to some strange/degenerate cases.

* You say that the reason an open source system won't do it is because the MVP is too expensive. But the MVP for a cost-based optimizer is also very expensive, and postgres has one of those. I think that was a much better investment than investment in the filesystem/scheduling layer.

jandrewrogers · on April 4, 2014

Jeff, I am familiar with your work, I lurked on the PostgreSQL hackers mailing list for years when I was hacking on that database. :-) I am not dismissing the coordination of OS and database, it just has really deep limits because the OS must hide information critical to optimizing database throughput.

While the increased throughput is a complex function of hardware, workload, etc, it is also consistently substantial. The reason why it works is simple: the database processes have nearly omniscient view of hardware and state and there is only (in modern designs) a single process per core. Consequently, even if you have thousands of concurrent high-level database operations, each process can dynamically select and continuously reorder the low-level operations to (nearly) optimally maximize the throughput for the execution graph at that moment because the execution is completely cooperative. You can do the “synchronous scan” optimization for CPU caches that you do for disk systems. You can schedule around any conflicts in the execution graph and even the impact of outside CPU interrupts can be detected and optimized around. And it is easy to track the aggregate costs of these choices. To the extent possible, every clock cycle is spent on end-user database work instead of database internals overhead.

So minimal processing stalls, micro or macro, and no context-switching or coordination overhead. All combined with incredible locality knowledge (by inference) that is not available if you let the OS manage things for you.

On your other two points:

- Bypass is generally more resilient partly because the software has more explicit and immediate knowledge of the nature of the fault and can do something sensible about it. Obviously you have to handle faults when they occur. A lot of OS behavior when faults occur is pathological from the standpoint of optimizing databases. It is like memory management in C; it requires extra effort but also adds extra power if you handle it well.

- Postgres has expensive capability add-ons to an existing, useful system so it is more incremental in nature. The problem with OS bypass database kernels (and I learned this the hard way) is that (1) they are huge in terms of LoC long before rudimentary functionality is available and (2) it takes many years of atypical software design experience to be competent at trying to write one. It could be done, but it would require a critical mass of a tiny demographic willing to do a lot of work. My argument in this regard was less about inevitability and more about statistical probability.

I spent a lot of years hacking on and customizing Postgres. I recommend it to anyone and everyone that will listen because it is a great piece of engineering and would still use it for many OLTP systems. But it does leave a lot of performance on the table for a variety of reasons that probably make sense for a portable, open source project. The fact remains that I can design and have built bypass kernels that are substantially faster largely by exploiting the optimizations bypassing offers.

jeffdavis · on April 4, 2014

Perhaps it's just because I've never seen a good implementation of a bypass, and I might agree if I had seen one. Like many things, maybe it just takes the right people to make it successful.

Postgres leaves a lot of performance on the table in much more basic ways, too, so I certainly am not suggesting that postgres is anywhere near optimal.

deafbybeheading · on April 4, 2014

Stupid (and somewhat tangential) question: how do bypass kernels work with virtualization, if at all?

jurjenh · on April 3, 2014

Is there a middle ground here somewhere? In that the kernel developers create some sort of DB specific hooks that allow some of the kernel bypass mechanisms to be implemented?

What are the key things that a kernel bypass version does different? Can these be separated out in a concise way which would lead to multiple DB implementations being able to use these same interfaces? Essentially for any major DB system, you'd want the kernel tailored anyway - you're not going to be doing much else on your DB server (are you?)

kev009 · on April 3, 2014

Or you could, you know, fix the OS like they are trying to do in TFA.

cookiecaper · on April 4, 2014

The argument is that "fix" is relative. Things that improves Postgres's performance may negatively impact other applications. The suggestion, therefore, is that Postgres takes control of these tasks for its own purposes, and then they don't have to worry about the implications for other systems or wait on anyone else.

hcarvalhoalves · on April 3, 2014

Maybe it's b possible to port a database to a model like OpenMirage's [1], building against all the runtime it requires but running on top of a hypervisor?

[1] http://www.openmirage.org/

mattzito · on April 3, 2014

And one of the reasons that software like Oracle is so complicated and expensive. Oracle spent years getting the OS kernels out of the way, partially for platform consistency/supportability sake and partially for performance. They obviously still depend on the kernel for a variety of things, but memory management, networking, filesystems, etc. can all be done in Oracle-space at this point.

beagle3 · on April 3, 2014

But what it buys Oracle is not performance, but rather bragging rights for being a little faster at a moment in time.

However 3-6 months later, you'll get comparable performance from improved kernel, CPU and disk speeds. Are those 20% in performance for 6 months worth the premium oracle is charging (which, in part, reflects their harder work)? For most customers most of the time the answer is no.

If you depend on performance, you don't use Oracle in the first place - Vyahu, OneTick, kdb, Vertica are the speed demons (as well as TimesTen which was acquired by Oracle - but is distinct from their "standard" offering)

dfox · on April 3, 2014

These two issues are actually perfect counterpoint to bypassing kernel caching and scheduling. There is no way to overcome first issue in userspace, second is trigered by what amounts to too agressive caching in userspace, and third is something that you should not be doing anyway (I'm not exactly sure if there is something better that kernel can do in that case, except doing write-thru caching on writes which has it's own performance implications for other users of O_DIRECT).

jandrewrogers · on April 3, 2014

I think you aren't understanding what a bypass kernel does. It literally takes control of the physical resources to the extent kernel interfaces exist that allow it to do so and, at least in the case of Linux, the level of control possible is quite high. That means taking control of the CPU, physical memory, disk I/O, network I/O, etc so that they can be scheduled and managed from userspace. All at the same time starting with bypass kernel initialization. For obvious reasons, it is usually a bad idea to run other non-trivial processes on the same machine because they will tend to be resource starved.

Once you have these resources, you can organize them and use them as you see fit. Because it is not going to the kernel for any resources or buffering or scheduling or memory etc, there is little opportunity for the OS to do the wrong thing with resources that already are tightly controlled by the runtime. However, this is also why it is an "all or nothing" kind of situation.

twoodfin · on April 3, 2014

Interesting. When I read your first comment, I didn't realize you meant "bypass kernel" in as wide a scope as you describe here. Usurping CPU scheduling, in particular, was something I didn't think was common practice. Can you name a (presumably commercial) DBMS that represents this kind of "self-manage everything" architecture?

jandrewrogers · on April 4, 2014

Most of the big "enterprise" OLTP databases are designed this way; the less portable they are, the more likely they are doing deep bypass optimizations. DB2, SQL Server, and similar are bypass designs. Oracle used to be a weird hybrid, due to portability requirements, but since they took control of the hardware I would assume recent versions are mostly pure bypass.

Most commercial analytical databases are not bypass, due in large part to the fact that most of them are based on Postgres, ironically.

drzaiusapelord · on April 3, 2014

How does this work? I don't see how its possible to fight the scheduler or how caching works via userspace.

Maybe I don't understand this stuff, but maybe the kernel should have some bypass API for high performance applications, instead of coders finding curious ways to fight it.

usefulcat · on April 3, 2014

You can bypass filesystem caching on Linux using O_DIRECT. The tradeoff (besides the obvious lack of caching) is that there are specific alignment restrictions on the length and address of the userspace buffers and file offsets, and these restrictions may vary by filesystem and by kernel.

dap · on April 3, 2014

You can bypass the scheduler by binding processes to processors. Some runtimes do this -- and sometimes get much worse performance as a result. My colleagues and I have seen many programs that think they know better than the OS do much worse for trying.

jandrewrogers · on April 3, 2014

The kernel does have bypass APIs, or APIs that can be used for that purpose. However, you can't use just a little; once you start down that path you need to bypass everything.

To be clear, while the bypass APIs are simple to use you actually have to know what you are doing since you become responsible for doing things the OS used to do for you. I/O scheduling, disk caching, process scheduling, memory management, etc all have to be reimplemented in userspace.

It is why I mentioned that the skill set required to do bypass kernels is fairly rarified. You can't just reimplement what the OS already does, you need to implement something that is different than the OS design but also better at providing functionality the OS provides for the use case. You are essentially writing a purpose-optimized OS without the device drivers.

ape4 · on April 3, 2014

How about a special filesystem api for databases?

pekk · on April 3, 2014

At this point why aren't you just building the database on top of a minimal OS? Then run that in a VM, say

ori_b · on April 3, 2014

> Then run that in a VM

You think that VMs aren't subject to the host OS scheduler, caching, and memory allocation quirks?

Dylan16807 · on April 5, 2014

>scheduler

eh, it could pin threads at realtime priority

>caching

vm wouldn't use the OS cache

>memory allocation

it can easily allocate memory up front

moe · on April 3, 2014

I love that infinitely nerdy stuff like this can still make it to the HN homepage (there's still hope!).

Stuff like this really needs to make it into the PG tuning guide (https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv...). The only place where it will ultimately be seen by a worthwhile audience.

vog · on April 3, 2014

Indeed, this should really go into the PostgreSQL docs, possibly even be raised on their mailing list to get it eventually included into the official docs, maybe somewhere around here:

http://www.postgresql.org/docs/9.3/static/runtime-config.htm...

jafaku · on April 3, 2014

Nerdy stuff? Wtf?

What are you, a school bully like the ones we see on American shows?

dgarrett · on April 3, 2014

> I love that infinitely nerdy stuff like this can still make it to the HN homepage (there's still hope!).

I can't wait for "Subtly Bad Things Linux May Be Doing To 2048"

tadfisher · on April 3, 2014

This isn't helping.

viraptor · on April 3, 2014

> infinitely nerdy stuff

Since when is "in depth" and "technical" equivalent to "infinitely nerdy"? We're professionals using that kind of information for work, not nerds doing infinitely nerdy stuff.

d23 · on April 3, 2014

Since when is "anti-social" and "nit-picky to the point of absurdity" equivalent to "expressing critical thought"?

npsimons · on April 3, 2014

As a professional using this kind of information, I have to strongly disagree. This is infinitely nerdy stuff, and I wouldn't have it any other way :)

briandh · on April 3, 2014

I highly doubt he/she meant it in a derogatory way.

davidpardo · on April 3, 2014

Being a user of both Linux and PostrgreSQL, I'm very interested in this issue, but I only understand some of the words...

Could everybody wiser than me tell me if I should be concerned and the possible implications of these decisions? Should I invest in alternative platforms?

Orva · on April 3, 2014

In NUMA front Linux is, as far as I know, leaps and bounds ahead of FreeBSD, Solaris and Windows. All four platforms offer ways to tune process and memory allocations by hand, but Linux is only platform that puts lot of work to make automatic NUMA scheduling actually work. It is actually pretty difficult problem. Something to read if you are interested:

http://queue.acm.org/detail.cfm?id=2513149 http://lwn.net/Articles/591995/ http://lwn.net/Articles/568870/

ars · on April 3, 2014

These are very minor issues and will be noticed by very few.

If you are running a huge database and need every possible bit of performance this will matter, otherwise it's not something to worry about.

artumi-richard · on April 3, 2014

You probably don't need to worry about it, it's just further proof that the postgresql devs do it right, and you're in safe hands.

Spooky23 · on April 3, 2014

Back in ye olden times when I was an Informix DBA, we had to worry about stuff like this for storage.

It was always a fight with the storage guys, because they wanted to use their fancy Veritas File System for optimizing disk utilization, and us prima-donna DBAs wanted raw LUNs and allow the database engine to manage our disk, because it maximized our transaction throughput. Some DBAs even wanted whole disks allocated, so they could control where data lived from a disk geometry POV. There were (mostly) valid arguments for doing this, most of which have gone away over the years.

This is an issue like my disk issue -- corner cases that need to be thought about in situations where you are investing lots of engineering effort into your databases. If you don't have a couple of angry DBAs whom you're always arguing with, you don't need to worry about this.

fauigerzigerk · on April 3, 2014

It is something that could be helpful to have at the back of your mind if you're tasked with optimizing postgres and OS settings for big workloads.

But it's one of a myriad of little things, not something that could inform a platform decision. It's much more interesting for kernel devs than it is for postgresql users.

I think what this shows is that issues related to interactions between RAM, caches and CPU cores are becoming a lot more complex on all platforms.

treenyc · on April 3, 2014

same here. Was a FreeBSD user, but find hard to find VM a few years ago that support FreeBSD, may switch back, if this issue is NOT addressed.

dfox · on April 3, 2014

First issue is relevant only on systems that have more than one NUMA node, which is probably every meaningful physical server and essentially no VM (at least on Xen, multiprocessor VMs are single NUMA node), as it does not make much sense to advertise NUMA topology to guest VMs.

Second issue is relevant for postgresql mostly only if you use very large shared_buffers which anyway is not recommended for general workloads. Writing page that exists on disk and was not read short time before is not especially common thing to do.

rodgerd · on April 3, 2014

NUMA can absolutely ping you in virtual servers, but without access to the hypervisor you'll never know why it's happening (JVMs straddling NUMA regions have caused me pain in the past, when the guest was split across memory regions).

dfox · on April 3, 2014

The point is that kernel inside VM guest knows nothing about NUMA, so it cannot do any kind of NUMA optimizations hence such optimizations cannot hurt performance as they do not happen at all.

rodgerd · on April 3, 2014

Actually, KVM can allow you to create NUMA domains inside the guest, for better or worse.

pjc50 · on April 3, 2014

If you're running in a VM, the VM may impose its own constraints on the disk write and virtual memory issues.

forkandwait · on April 3, 2014

Anybody know of a FreeBSD comparison?

kev009 · on April 3, 2014

+1 I'd love to see a FreeBSD kernel hacker chime in here.

ZFS enables some interesting things for pgsql:

* http://open-zfs.org/wiki/Performance_tuning#PostgreSQL - it seems like the primarycache setting prevents the double buffering problem that Linux' page cache has

* http://citusdata.com/blog/64-zfs-compression

I run pgsql on FreeBSD/ZFS and have no complaints but am not taxing the system.

sickpig · on April 3, 2014

other useful resources on using PostgresSQL on a Linux server with a ZFS filesystem:

- http://adpgtech.blogspot.it/2013/04/linuxzfs-is-great.html

- http://wiki.postgresql.org/images/8/86/PostgreSQL_on_ZFS.pdf

- http://lanyrd.com/2013/linuxcon-north-america/scqmfb/

- http://www.postgresql.org/message-id/CAGa5y0NkVLTNGywmZ6YS8O...

ksec · on April 3, 2014

I love ZFS, but Oracle is holding it up which is very annoying.

kev009 · on April 3, 2014

How do you mean? I trust OpenZFS more and in fact would consider it negligent if Oracle wasn't cherry picking OpenZFS bug fixes. Several original Sun engineers including ZFS co-founder Matt Ahrens are involved with OpenZFS.

bcantrill · on April 3, 2014

Yes, as I have elaborated upon at length[1], the ZFS engineers have long-since left Oracle -- and the open community's ZFS (that is, OpenZFS) has become the ZFS of record.

One clarification: Oracle can't actually cherry-pick back OpenZFS bug fixes because they are (ironically) violating the CDDL by not making available source code. This isn't an issue for the code for which they hold copyright -- but that doesn't include any of the bug fixes and features that we've seen in OpenZFS since 2010. And yes, it is absolutely negligent, but of a different sort than you intended...

[1] http://www.youtube.com/watch?v=-zRN7XLCRhc

kev009 · on April 3, 2014

Cheers Bryan, that's a great one of your talks and thanks for the interesting licensing comedy.

octotoad · on April 3, 2014

Do you mean they're holding back progress? With OpenZFS on the scene, it's highly likely that the original Sun/Oracle implementation will become irrelevant to everyone apart from users of Oracle's Solaris releases.

bifrost · on April 3, 2014

Aside from the NUMA stuff, the disk buffer issue probably depends on if you're using ZFS or not.

tankenmate · on April 3, 2014

And also whether the kernel is reporting the true block size of the device; xref advanced format drives (what the drive reports and what the drive does can be two different things), SSDs (minimum write size is typically much bigger than the reported block size), SANs. The true underlying block size can sometimes be determined by benchmarking the latency of different write sizes and also unaligned writes of different sizes.

ams6110 · on April 3, 2014

Dragonfly is probably the best BSD to run for database performance.

http://www.dragonflybsd.org/performance/

profquail · on April 6, 2014

That chart is 18 months old. FreeBSD 10 got a number of SMP-related improvements (and some for ZFS as well, if I'm not mistaken). It'd be interesting to see an up-to-date comparison with FreeBSD 10 (and maybe 11-CURRENT), DragonFlyBSD 3.6, and NetBSD 6.1 (and maybe a couple of Linux distros for good measure).

treenyc · on April 3, 2014

FreeBSD has something like

SHMMAXPGS, SHMMAX in postgresql.conf

in the

sysctl.conf kern.ipc.shmall kern.ipc.shmall

DrJokepu · on April 3, 2014

Not relevant since PostgreSQL 9.3 as PostgreSQL now uses Posix shared memory / mmap rather that System V.

kev009 · on April 3, 2014

I don't think this has any direct influence on performance.

It was necessary to increase the default limits for larger shared_buffers, etc on pgsql prior to 9.3 where SysV shared memory was used, but this was common to many *nix operating systems.

treenyc · on April 3, 2014

thanks a lot really. I have being use Linux for a while now. Thanks for the update.

caf · on April 4, 2014

Point 2 is not actually the case, as long as your write does not partially fill a pagecache page (pagecache pages are the same size as the architecture's native page size - 4K on x86, x86-64 and arm).

You can demonstrate this with a program like the following:

  int main(int argc, char *argv[])
  {
    int i;
    char pattern[512*1024];
    int fd;

    for (i = 0; i < sizeof pattern; i++)
        pattern[i] = 'X';

    fd = open("testfile", O_RDWR | O_CREAT | O_EXCL, 0666);

    if (fd < 0)
    {
        perror("open");
        return 1;
    }

    while (1)
    {
        pwrite(fd, pattern, sizeof pattern, 0);
        fdatasync(fd);
        posix_fadvise(fd, 0, sizeof pattern, POSIX_FADV_DONTNEED);
    }

    return 0;
  }

...then watch vmstat or iostat while it's running. Plenty of writes, no reads.

On the other hand, if you subtract one from the size of 'pattern', you'll see that you also get reads (as partially writing the last page requires a read-modify-write cycle).

spaznode · on April 3, 2014

I was under impression from reading another article that setting the correct kernel io scheduler to use helps with a related concept, maybe possibly: http://www.cybertec.at/postgresql-linux-kernel-io-tuning/

raverbashing · on April 3, 2014

For 1: I don't remember exactly which machines use NUMA, but I thought it was limited to 1st Opterons (and the behaviour makes sense)

2: Not sure, this may be specific to FS, or something that has to do with the behaviour of MMAPed files however I don't know how do you guarantee that what you're writing corresponds to a single block in the FS (unless you're writing directly to /dev/sda and even then)

kev009 · on April 3, 2014

1 affects any multi-socket Intel system after Nehalem, and any multi-socket Opteron+ AMD system.

So basically most physical servers.

MrBuddyCasino · on April 3, 2014

This is slightly OT, but I always wondered: what is the overhead of fetching cached stuff from the file system vs. having a built-in cache? Is there a way to circumvent going through the kernel and avoid a context switch?

In other words: could a database have only minimal built-in caching and instead rely on the OS cache?

Negitivefrags · on April 3, 2014

Postgres does rely on the OS cache. The "effective_cache_size" parameter is there for you to tell postgres how big you are expecting your OS cache to be and it is supposedly used when planning queries, presumably so it can rely on a plan having cached reads.

jdubs · on April 3, 2014

The database has a better understanding of the workload and therefore can make more intelligent caching decisions. The file system caching system does a good job however, it's not as effective in predicting what should be cached.

atmosx · on April 3, 2014

If we had some metrics in a graph form, backing up the post would be great for us mere mortals to understand which way to go in order to achieve maximal performance.

lallysingh · on April 3, 2014

For (1), you can use numactl to change the default behavior. For #s (2) and (3), I'm pretty sure this is FS dependent?

rivert · on April 3, 2014

related post on lwn: https://lwn.net/Articles/591723/

contingencies · on April 3, 2014

Not to detract from the very intelligent and reasoned posting, but what tiny percentage of people honestly still use fat-ass RDBMS as their primary datastore and would be better off performance tuning it at the kernel IO level than actually analyzing their load and subsequently sharding or migrating their data structures to less behemoth-like datastores? Yes, RDBMS are easy to hire developers and DBAs for, are well supported and full-featured. However, in this day and age using them just feels a little ... lazy ... for most workloads.

kev009 · on April 3, 2014

That's a very obtuse point of view. I'm curious sociologically: what field do you work in and what your exposure to data is?

Consider an inventory system for a big box retailer. I can't think of anything better than a fat-ass RDBMS as the primary data store. Sharding sounds like a horrific idea. There are myriad workloads like this.

Personally, I've seen pgsql handle terabytes of data just fine and it wasn't really noteworthy or a source of problems to even bring up considering something else. YMMV but it's a good idea to use logic and reason to dictate architecture instead of following the shiny thing or hubris.

davidw · on April 3, 2014

Well, yes and no.

Everybody knows that relational databases don't scale because they use JOINs and write to disk.

Also, relational databases weren't built for web scale. MongoDB handles web scale. You turn it on and it scales right up.

And before you knock shards, shards are the secret ingredient in the web scale sauce. They just work.

Furthermore, relational databases have impetus mismatch, and Postgresql is slow as a dog. MongoDB will run circles around Postgresql because MongoDB is web scale.

ro_sharp · on April 3, 2014

Are you being intentionally sarcastic? Because this reads a lot like http://www.mongodb-is-web-scale.com/

Edit: Whoops, just read your reply :)

GFischer · on April 3, 2014

His post was a textbook example of Poe's Law

http://en.wikipedia.org/wiki/Poe's_law

"without a clear indication of the author's intent, it is difficult or impossible to tell the difference between an expression of sincere extremism and a parody of extremism"

pistle · on April 3, 2014

Thank you for this. TO all the haters, davidw is quoting a funny animated pokeyfun of people who lightly consider a DB problem and throw out the NoSQL mantras, without even understanding what they, themselves, are even saying and implying.

rimantas · on April 3, 2014

Check this "web scale" out: http://smalldatum.blogspot.com (via http://dom.as/2014/03/31/mongo-io/ )

davidw · on April 3, 2014

I was actually referring to this: http://www.mongodb-is-web-scale.com/ - which is what contingencies' comments reminded me of, but I guess people either didn't get it or thought it was a bit stale. C'est la vie.

I am a happy Postgres user and always default to it unless I am really sure a project calls for something else.

captainmojo · on April 3, 2014

It cracked me up!

And same here, I tell people to start their datastore selection with looking for a reason NOT to use Postgres.

sanxiyn · on April 3, 2014

You know what is web scale? WebScaleSQL is. :)

http://webscalesql.org/

contingencies · on April 3, 2014

Sure, if you are running stats across everything in a nontrivial and frequently changing way, then you have a great ally in an RDBMS. But I don't believe many people do that, because usually that sort of stuff is pretty damn predictable, executed offline, or can be consolidated from shards.

However, if you have any of the following: (1) vastly different security requirements for different parts of your datastore (2) vastly different backup schedules or temporal sensitivities (3) privacy requirements deriving from different legal jurisdictions (4) wish to scale by running on commodity hardware (5) cannot tolerate any downtime whatsoever ... and probably many other cases ... then in my experience you are going to meet some serious issues with conventional RDBMS, at least with the vast majority of configurations.

I'm all for logic and reason too... but your comments seem closer to name-calling and a single example.

arethuza · on April 3, 2014

Apart from your point on "wish to scale by running on commodity hardware" I'd say that relational databases handle all of those other things pretty well - might cost you an arm and a leg for the licenses, hardware and network connections but those goals are achievable.

Anyway, in a lot of environments it's application's that drive choice of database engine - not the other way round.

kev009 · on April 3, 2014

I counter that many people have met each of your numbers for the past 20 years using commercial RDBMS.

I can't think of anything that is magnificently easier or better at solving your numbers, especially all together. #4 seems less relevant, is it really cheaper than operationalizing a distributed system? These days, likely for situations where consistency can be relaxed. Not so for many business workloads.

Can you enlighten us with some example products for your numbers?

contingencies · on April 3, 2014

Haha, went out and these comments got downvoted to pluto. Honestly though, I haven't heard a decent argument in response other than "lazy is good". Sure, but architecturally, you're basically in the "engineers run the architecture" or "its an architecture of convenience for business purposes" camp. I'm in the former, I'd like to hope that some nontrivial subset of the participants here are in the former, but most are no doubt in the latter. People get upset when you slight their world. That's understandable. The TLDR is: even if people made a lot of stuff happen 20 years ago; it doesn't justify using the same methods today, and discussing the tradeoffs is constructive not dismissive.

kev009 · on April 4, 2014

The reason you're getting down voted so heavily is because you lobbed heavy accusations without any backup (projects, whitepapers, journal submissions please). Distrusted databases are still a specialty today, mainly because they have inherent tradeoffs. If you don't understand how hard those tradeoffs are you SHOULD NOT be using a distributed database by default. I was hoping maybe you had something tangible to share.

contingencies · on April 4, 2014

If you look at what I actually said, I was expressing some skepticism with regards the payoff from investing time on very low level optimizations on conventional RDBMS for most workloads versus sharding the database and/or migrating to other storage models. That's a tangible line of thinking to consider. Note that I did not at any point say "someone's PhD asserts...", talk in absolutes, or slam RDBMS as a potentially viable or proven option.

cwyers · on April 3, 2014

"However, in this day and age using them just feels a little ... lazy ... for most workloads."

IN DEFENSE OF BEING LAZY AS A PROGRAMMER

The essential mission of a computer programmer is to use computers to solve problems. Being lazy can come in one of two forms:

1) Solving problems badly or not solving them at all, or 2) Relying on someone else's solution instead of coming up with your own.

Using a RDBMS is Type-2 Lazy. Now, I want you to get out a pen and paper and write this next bit down, because it is the most important thing you will ever learn:

EVERYBODY SHOULD BE TYPE-2 LAZY BY DEFAULT, ONLY DEVIATING FROM THIS IF THERE IS A COMPELLING REASON NOT TO.

Why?

1) Other people's solutions have been used, which means they've been tested in real-world use. Things you haven't thought of yet because you don't yet have a working solution have been at least discovered, because people are using it. Sometimes they're even addressed. 2) Other people's solutions may have tools, documentation and communities built around them, making them easier to learn about, use and work with.

There are two decades of work put into Postgres itself, and even longer periods of work put into the general field of relational databases. Corner cases you can't even conceive of have been encountered and patched for. The entire codebase of Postgres contains large amounts of accumulated wisdom on how to store data in a safe and retrievable fashion. And large communities have sprung up, to provide you with tools and wisdom on how to use it to best suit your needs.

NoSQL databases are useful for certain workloads and setups. It would be absolutely wrong to dismiss them out of hand. Having said that, anyone whose DEFAULT PREFERENCE is to eschew traditional RDBMS as a data store in favor of software that has been around for less than a quarter of the time that even the newer of the popular RDBMS systems have been around because using well-tested solutions is LAZY needs to have a restraining order keeping them at least 100 yards away from a keyboard.

contingencies · on April 3, 2014

Absolutely agree. The compelling reason can be business requirements as previously noted, eg. scalability, security, law. Unfortunately if you're doing something global and non-trivial that's the rule rather than the exception, in my experience.

Confusion · on April 3, 2014

Don't feed the troll.