DynamoDB to Postgres: Why and How

hesdeadjim · on June 21, 2017

DynamoDB was a constant source of anxiety for me and my team. We fit the usage pattern perfectly, so the author's points didn't apply to us.

Our primary problem was having to constantly manage and change our provisioned capacity so that we could respond to usage spikes and keep our bill from becoming stratospheric. Worse, anytime we'd add new functionality -- such as nighttime analytics jobs that scanned the table or daily backups, it meant revisiting the parameters for our auto-scaling or risk getting throttled and having our application randomly start spewing errors.

DynamoDB's billing and provisioning model is awful to deal with. When your team is having serious conversations about the repercussions of your data values going from 4kb to 8kb, you very quickly begin questioning your decision to use DynamoDB in the first place.

Zaheer · on June 21, 2017

FYI DynamoDB in last few weeks added auto-scaling: https://aws.amazon.com/blogs/aws/new-auto-scaling-for-amazon...

Additionally they somewhat recently added an optional caching layer: https://aws.amazon.com/dynamodb/dax/

hesdeadjim · on June 21, 2017

Nice they finally added auto-scaling. It seems to function pretty similarly to the tool we built, with the same (documented) limitations.

DAX would have definitely addressed many of our issues, had it been available at the time. Glad they added that.

azinman2 · on June 21, 2017

So does that address all your issues?

hesdeadjim · on June 21, 2017

No, for our usage needing to worry about provisioned capacity at all was becoming a deal-breaker. That especially became tedious to estimate and plan for as we had more variability in the size of our values we needed to store.

eropple · on June 21, 2017

The high score I ever saw at a company I worked for was about $85K in DynamoDB spend. That's an eight, a five, and three zeroes after it. It was a feature so otherwise silly and simple that I don't want to flame the developers thereof by describing it, but it might have been the most gratuitous burning of cash I've ever seen at a startup.

hesdeadjim · on June 21, 2017

Ha, yea I've heard of numbers like that. I remember talking to someone at re:Invent who described their high five figures monthly bill in almost reverent terms -- it made sense after he described their 200+ MySQL server catastrophe DynamoDB replaced.

We actually ended up engineering our system to use a variant of event sourcing. This let us dump DynamoDB and use RedisLabs and S3 in such a way that we were able to handle 1mil DAU with a 10:1 write/read ratio for around $5k/mo -- including front end/backend instances.

kornish · on June 21, 2017

> This let us dump DynamoDB and use RedisLabs and S3 in such a way that we were able to handle 1mil DAU with a 10:1 write/read ratio

This sounds really interesting – do you have a writeup of the system architecture anywhere?

twohlix_ · on June 21, 2017

I would also be up for some reading regarding optimizations/architecture choices for a 10:1::write:read setup.

hesdeadjim · on June 21, 2017

It's definitely a write-up I've wanted to make before.

The short version of event sourcing is that instead of looking at your data model in standard CRUD terms, represent your data as an ordered list of changes. Updating an address field on a user model would become something like ChangeEvent{address=...}, rather than a full read/write of a user blob.

To derive the current state of the object, you would replay and apply all events to a blank object. Obviously over time the list of changes could become enormous, so what you do to optimize is periodically create a snapshot of the current full state of the object and only replay events that happened after it was made.

For our persistence layer we used Redis as the input queue for all events. Each object we stored required a list to push changes into, an integer counter to represent the sequence ID of the most recent event in the list, and a string value containing the location in S3 of the most recent snapshot (if it exists). Performance was incredible for writes because a) it's Redis and b) RedisLabs supports semi-transparent sharding.

We used S3 to store the snapshots. We called the creation process a "compaction" because besides storing the most recent full state of the object, we would also store off the list of changes used to make it and purge those events from Redis. This kept our Redis storage usage tightly bounded and our costs down.

This compaction process was also 100% safe with no chance of race conditions or data loss. Our algorithm would fetch the list of changes from Redis, create and serialize the current state of the object as a snapshot, and then write it and the list of changes to S3. Only upon success of those S3 writes would it then issue a command to Redis to purge that object's list of changes. To prevent race conditions on this part of the operation, we would send the sequence ID we are compacting from and in a Redis script we would atomically check to make sure no more-recent compaction has happened, update the snapshot pointer, and flush the list of changes only up to the sequence number we compacted from (changes were free to be added to the list while a compaction process was in progress).

The killer part of this whole setup was that those snapshots and change lists stored in S3 were completely immutable and formed a linked list of the entire history of changes for an object. We could visualize and inspect the entire change history of an object, find bugs and retroactively fix them by changing how an event is applied, and if needed for read performance, cache the S3 snapshots in memcache without any concern for needing to invalidate them.

Concurrent writes to an object became less of an issue as well. Instead of needing to do a CAS write or a transaction, you would just push the change you wanted to make to the object, read it back, and verify it was actually applied. No locking needed.

The major downside to this architecture is that any querying you need to do will always be weakly consistent and require an external index. We had a small Elastisearch cluster for some of our objects that had fields needing indexing.

One last major advantage of this setup was that it was trivial to "clone" production to your local dev machine. Because snapshots were immutable, we set up a tiered database layer that would first try and load snapshots locally, and failing that, would use a read-only IAM user to access S3 to pull down a snapshot from production. All you would then do is load a backup from a RedisLabs database locally and boom, you have an exact copy of production that you can freely modify because any new writes all happen locally.

Believe it or not, the most unexpected cost from this architecture was PUT requests to S3. Depending upon the frequency of your compactions it can start to add up. However, if usage declines, that cost also declines with no impact on the availability of your data.

kornish · on June 22, 2017

Fascinating – thanks very much. As you alluded to, this does sound like an ideal use case for Kafka these days.

What understanding of your access patterns did you have before you decided to engineer the system that you did? Also, I'm curious about developer costs vs compute costs. How much would have been paying for a less developed but less efficient solution?

hesdeadjim · on June 22, 2017

We had the luxury of years of prior experience within the company of building mobile game backends so we had a very good idea of what the access patterns would be. That said, the new product we were building the backend for was filled to the brim with "social" features that presented new scalability and performance concerns.

I was able to justify the existence of a small dedicated backend team by claiming I could deliver 10x the performance at 1/10th the cost of our most popular game's backend. Given that the monthly server costs of that game could be measured in multiple engineer salaries, this proposition was quite attractive. While definitely a bit of hyperbole, we actually ended up hitting those targets when it was all said and done.

eternalban · on June 21, 2017

"Simplification 2: Streams Meet Tables":

https://www.confluent.io/blog/introducing-kafka-streams-stre...

hesdeadjim · on June 21, 2017

Yep, bingo. I would definitely put some time into learning Kafka if I were to rebuild this type of architecture again.

eropple · on June 21, 2017

That sounds rad. What was the failure case if Redis choked?

hesdeadjim · on June 21, 2017

It never happened, but we actually did see S3 errors sometimes where it would just fail to read an object. The higher level application logic was built to retry (with back-off) in situations like that. So if Redis choked, it'd be handled the same.

linkmotif · on June 21, 2017

> I’m not a database guy, I’m a node guy.

Is this really a distinction your "scrappy startup type" can afford to make?

> Although since version 3.4.x Mongo passes all Jepsen tests, when our original database was chosen the current version of Mongo at the time did not and was deemed unreliable by the team

Does not pass all Jepsen tests => deeemed unreliable :( Wow. Of all possible reasons, THAT'S the reason MongoDB wasn't chosen? Jepsen, man. Pretty sad so many people just look at Jepsen results like this. Jepsen, really, is about the tests more than the results. It highlights various failure modes that may be difficult or uncommon to replicate in production. These failure modes are important to know about to understand the product you are using, but are not really the end-all when it comes to deciding whether that product is worth using!

Anyway, show me a database that passes Jepsen and I'll show you a database that only accepts writes to a single node.

andreimatei1 · on June 21, 2017

> Anyway, show me a database that passes Jepsen and I'll show you a database that only accepts writes to a single node.

I know one - CockroachDB :). Relational, strongly-consistent, SQL, elastic, multi-active replication - the dream! (I work on it)

https://www.cockroachlabs.com/blog/cockroachdb-beta-passes-j... http://jepsen.io/analyses/cockroachdb-beta-20160829

Thaxll · on June 21, 2017

“For instance, on a cluster of five m3.large nodes, an even mixture of processes performing single-row inserts and selects over a few hundred rows pushed ~40 inserts and ~20 reads per second (steady-state).”

If someone can explains those abysmal low numbers ...

andreimatei1 · on June 21, 2017

Well, those numbers, if I remember correctly, refer to insert rate in the face of some extreme amounts of contention - tons of overlapping reads and writes (that's how the test was trying to trigger consistency violations). And moreover, they were measured while the Jepsen test framework was messing with the cluster. Contention is a problem for every database, and particularly so for CRDB. In the absence of contention, we routinely see thousands of queries per second per node.

Since the time of that analyses, we have done a significant amount of work for speeding up these high-contention scenarios, with quite dramatic differences in some cases we looked at. We pretty much changed our transaction execution model from a more "optimistic" one where transactions can abort each other and induce thrashing, to something resembling more the traditional row locks. So, hopefully, even for these atypical uses cases we should generally perform much better.

AlisdairO · on June 21, 2017

"These, of course, are pitiful numbers for a database. However, it’s important to note that these numbers are not representative of what real-world applications can expect to see from CockroachDB.

First, this is a test in which the Jepsen nemesis is continuously tweaking the network configuration to cause (and heal) partitions between different nodes. You don’t commonly see behavior like this in a real world deployment.

Second, tests like these deliberately try to reach high levels of contention, in which many transactions are trying to operate on the same keys. In real-world situations contention tends to be much lower as each user is operating mainly on their own data. The performance of all transactional systems suffer under high contention (although CockroachDB’s optimistic concurrency model fares worse than most)"

arjunnarayan · on June 21, 2017

Reposting my comment from when this came up on our 1.0 launch:

IIRC, in the case that aphyr refers to for these specific numbers, the reads are scans that span multiple shards[1], while the writes are writes to single shards.

[1] even though aphyr says it's just a hundred rows, the tables are split into multiple shards because aphyr in this case was specifically testing our correctness in multi-shard contention scenarios. In production you wouldn't have multi-shard reads crop up until you were doing scans for tables that were hundreds of thousands of rows in size[2]. It's easier to picture if you think of this performance speed difference in the scenario where you are doing full table scans spanning multiple shards on multiple computers while the underlying rows are being rapidly mutated by contending write transactions. The transactions get constantly aborted to avoid giving a non-serializable result, and performance is suffering. We agree that the numbers in this contention scenario are too low, and we are actively working on high-contention performance (and performance in general) leading up to our 1.0 release[3].

[2] Specifically, we break up into multiple shards when a single shard exceeds 64mb in size.

[3] You can follow along on one of the PRs that address this specific performance issue here: https://github.com/cockroachdb/cockroach/pull/13501

didip · on June 21, 2017

CockroachDB peeps,

you guys should really publish a new blog post with updated performance benchmarks.

Super low throughput numbers is deterring a lot of potential users, imho.

linkmotif · on June 21, 2017

Hehe ;) Not to mention—it's Postgres compatible!!

hmottestad · on June 21, 2017

RethinkDB also passes jepsen.

a-robinson · on June 21, 2017

For single-key operations only, but that is still much better than the average system he's looked at.

linkmotif · on June 21, 2017

It's also defunct so... :)

jjuel · on June 21, 2017

Open Sourced is definitely different than defunct...

manigandham · on June 21, 2017

Without a company sponsoring and providing support, it is effectively dead for any serious commercial use.

linkmotif · on June 21, 2017

I know I know I was just kidding :D

StreamBright · on June 21, 2017

I am a little hesitant using a database that is younger than 10 years of age (same goes to programming languages, frameworks, etc.).

tootie · on June 21, 2017

I think if you're swimming a sea of options you look for any port and Aphyr's blog gives you all sorts of ammo to eliminate choices even if they're only failing on microscopic edge cases that will never affect your small-scale project.

lomnakkus · on June 21, 2017

Do you really think network partitions are an edge case? Jepsen doesn't actually do anything particularly edge-casey -- it just introduces network partitions and verifies that the cluster doesn't do any crazy things.

(There are also relatively common cases it doesn't test, e.g. hard power-off.)

manigandham · on June 21, 2017

Jepsen is not microscopic edge-cases. It is the heart of the distributed system promises that database vendors make but routinely do not meet.

If you would rather go back to trusting marketing speak instead, good luck to you.

tootie · on June 22, 2017

I think the fact that most of the DBs that failed the Jepsen test are also running thousands of successful production installs is the counterargument.

manigandham · on June 23, 2017

That's not much of an argument and "successful" is pretty subjective. Many teams have found issues where vendors have claimed otherwise, very common with mongodb and elasticsearch.

There are also thousands of problems that happen with no public reporting and no vendor is going to willingly release details of their support cases, especially when they're being paid to fix their own problems.

There's nothing but positive outcomes from jepsen and holding databases to strict standards to see if they actually meet the goals they state.

neovintage · on June 21, 2017

The really important part of this post is the section articulating why not Dynamo. It uncovers a big question I ask of all of my customers when they're trying to choose data stores at any point during the life of their application.

Do you know your query patterns for your application?

If you know how your application / product need to pull data from the data store, you way more information that will allow you to pick the right data store to fit the problem at hand.

When you don't have any prior query patterns to go off of, picking a relational database to start your application is a good choice. You need the flexibility to allow your schema to change over time. The schema and query patterns will eventually reach a point where changes don't happen all that often and then you can look to other datastores if you need to scale. Plus, are you really going to have THAT much data when you start?

Disclosure: I work on Heroku Postgres. Could be a little biased....

lilbobbytables · on June 22, 2017

Similar to the article author, many just don't have experience with relational db's and have only heard that they aren't flexible and don't scale.

They don't realize just how easily performant Postgres can be - and often so easily by sticking basic indexes where they are needed. Similarly they may not realize the amount of data and load PG or MySQL can handle just fine.

paladin314159 · on June 21, 2017

> Assuming you know how your data needs to be queried when you create your table, then the hash/range key combination could be the answer to your query problems. However we did not.

DynamoDB's purpose is to allow you to easily scale a known query pattern. I definitely don't recommend it as a first solution when you don't know how the data is going to be accessed. Or when everything fits on one PostgreSQL box.

We've had a lot of success starting out with a big PostgreSQL box and migrating use cases to DynamoDB when a table gets too big and has a query pattern that fits (essentially K/V lookup with some simple indexes).

gamesbrainiac · on June 21, 2017

Good news is that postgres is going to get multi-master replication soon in postgres 10, so a lot of the painpoints when it comes to scaling our writes, are hopefully going to be solved.

bradhe · on June 21, 2017

> are hopefully going to be solved.

Right, because multi-master replication schemes have such a long lineage of being operationally simple. /s

paladin314159 · on June 21, 2017

Yeah, that's the real power of DynamoDB, despite all of its shortcomings. It's operationally simple (given you have the money).

crypto5 · on June 21, 2017

> Good news is that postgres is going to get multi-master replication soon in postgres 10

Where did you read this?..

anarazel · on June 21, 2017

I think the poster of that claim is confusing "plain" logical replication and multi-master replication. There's a lot of infrastructure in common between both, but the conflict resolution part of async mm didn't make it it into postgres 10...

b777 · on June 21, 2017

How does multi-master replication solve write scalability issues? I'm not familiar with the postgres implementation, but don't all writes have to happen on all servers? And if all writes need to happen on all servers, how does that improve write scalability as the load write load on each server is the same as it would be without multi-master replication.

anarazel · on June 21, 2017

> How does multi-master replication solve write scalability issues? I'm not familiar with the postgres implementation, but don't all writes have to happen on all servers?

It doesn't generally. You often can increase the total throughput because independent transactions can be batch committed when replicated, which can increase throughput noticeably, but that's more like a constant factor improvement than a real scalability improvement.

What it does however often allow you to do is to:

a) Address latency issues with writers distributed across the globe. If you have masters on each continent your total write rate can often be massively higher than if you have application servers from each of those continents write to a centralized database.

b) Partially replicate your data globally, partially not. Quite often you'll want to have some data available everywhere (typically metadata, security related data, etc), but a lot of the bulk data only available in certain places.

darkr · on June 21, 2017

> a) Address latency issues with writers distributed across the globe

How? Surely to provide ACID compliance all writers need to co-ordinate a lock if they want to update a record. Same with incrementing IDs etc.

anarazel · on June 21, 2017

> > a) Address latency issues with writers distributed across the globe

> How? Surely to provide ACID compliance all writers need to co-ordinate a lock if they want to update a record.

Well, there's different systems out there. What you want for something like this is asynchronous multi-master. Which, as you observe, allows for conflicts to occur, which need to be resolved somehow.

> Same with incrementing IDs etc.

The solution for those is usually to have each node assign sequences with different start values, have a system that distributes "chunks" of values (that was what BDR, an async MM version of postgres used to do), or avoiding sequences in favor of UUIDs or similar.

manigandham · on June 21, 2017

> postgres is going to get multi-master replication soon in postgres 10

No it's not. Postgres barely has decent replication as is. Logical replication is a good step beyond (while still missing major functionality) but nowhere near what it takes for multi-master replication.

anarazel · on June 21, 2017

> No it's not. Postgres barely has decent replication as is.

You're overly negative whenever postgres comes up on HN. Obviously the v10 stuff is not multi-master and the project definitely is not claiming it has, but saying that Postgres' replication "barely has decent replication as is" is a bit absurd given that it works quite well in a lot of high-throughput scenarios in a lot of companies, and has so for 5+ years. There's a lot of valid complaints about it you can make (especially around flexibility and operational issues), but it's definitely not just "barely decent".

I'm not saying that you have to be fan. It's fine to say the replication isn't good enough for everything, or just ignore postgres.

> but nowhere near what it takes for multi-master replication.

FWIW, there's async MM built on top of the relevant pieces, and it's used in production. It's not ready for wide audiences, but arguably (and I think one can very validly disagree about that) it's closer than "nowhere near".

manigandham · on June 22, 2017

What is the point of this comment? And then linking to my past comments? Is it wrong to have perspective and criticism or to express it multiple times?

I get that you're a committer but "overly negative" is as much your opinion as what I say, and my comments are the truth as I see it. I've even spent an hour talking to Ozgun about databases and moving to Citus before so my statements come from being a user, operator and maintainer of dozens of different database technologies in production for years, including postgres right now, and from all that experience it is definitely the most behind when it comes to replication/HA compared to anything else I use.

All of those threads have plenty of upvotes and other user comments so it's clear I don't stand alone here. Perhaps instead of policing my statements, time would be better spent on improving the features and experience of the product.

anarazel · on June 22, 2017

> What is the point of this comment?

To let you know that I find your comments overly negative, and that they might have a bigger impact if formulated in a bit more relative manner? I mean:

> Postgres is probably the weakest of all relational databases when it comes to scalability, both vertical and horizontal.

Doesn't strike me as a particularly useful comment, given the (to me) obvious hyperbole. If you'd instead, at least every now and then, explain what precisely you find lacking, people - like me - could actually draw useful conclusions.

I only said something because I remembered your name from a couple times reading quite negative and undifferentiated comments. And I linked to the comments because I searched for them to validate whether my memory of the past comments was/is correct.

> Perhaps instead of policing my statements, time would be better spent on improving the features and experience of the product.

I do that. Just about every day.

manigandham · on June 23, 2017

That postgres is behind in scaling (multi-core, replication, sharding, multi-master) against all modern relational databases is verifiable truth. Your judgement is clouded by your proximity to the project. Here's just 1 prior thread between us: https://news.ycombinator.com/item?id=13841496

And if you want more feedback:

-> Add (and enable) multi-core queries.

-> Build-in pgbouncer (or better connection management).

-> Enable 1-line command to setup basic replication (whole node or per db level) without having to edit anything else. Look at redis or memsql for inspiration.

-> Enable 1-line command to easily failover, remove hotstandby requirements.

-> Add DDL support to logical replication because I guarantee this will be the #1 source of problems for anyone that tries to use this.

-> Fix the poor defaults in all the configs.

-> Document the existence of pg_conftool and make it a first-class tool that can also edit pg_hba.conf automatically so I no longer have to deal with any config files. Store any other settings in the database cluster itself. Inspiration from sql server on linux or cockroachdb here.

IT/admin UX at this point is more important than the diminishing core upgrades. Postgres is good enough internally but has sorely fallen behind on usability with obsolete defaults, convoluted configs and way too many 3rd party tools needed for basic scaling. Saying use X tool is effectively the same as just use Y database that already does it. MySQL, MariaDB, even SQL Server are not standing still.

Solid in-cluster and cross-dc replication is the reason we stuck with memsql (+ built-in columnstore), not because it was possible but because it actually works easily and reliably with their fantastic memsql ops program. I've shared all of this feedback online over the years (along with many others) so the lack of understanding is really more a reflection of the postgres team and poor priority planning.

anarazel · on June 25, 2017

Just saw this now, I do wish HN had a proper notification feature...

> That postgres is behind in scaling (multi-core, replication, sharding, multi-master) against all modern relational databases is verifiable truth.

No, it's really not. It's hardly best of class in them, but the worst? Meh. That's just my point, you're making way to strong statements.

Just to start at your first point: multi-core. Mysql has just about no intra query parallelism, whereas postgres 9.6 has some limited support, and postgres 10 is decent, although far from great. Postgres' inter query parallelism was for a long while better than mysql, then mysql was better for a couple years, in the last couple years it's gone back and forth and depends on the specific use-case. So how is postgres verifyably worse than all other "modern" (whatever that means) relational databases?

> Your judgement is clouded by your proximity to the project.

That's quite possible.

> Here's just 1 prior thread between us: https://news.ycombinator.com/item?id=13841496

I do remember that, but there you're again arguing super reductionist. You're arguing that having to do

  CREATE PUBLICATION all_tables FOR ALL TABLES;

before subscribing to replication somehow makes logical replication fundamentally too complicated ("overbuilt mess").

Yes, the logical replication stuff has some serious limitations, and that's true. A significant subset already has patches out there for the next version, but that doesn't help current users.

The streaming replication support existing since 9.0 is pretty decent however, and has gotten a lot easier to use in the last releases, and especially with the saner defaults in v10.

> Add (and enable) multi-core queries.

There's some massive progress in v10.

> Build-in pgbouncer (or better connection management).

Indeed.

> Enable 1-line command to setup basic replication (whole node or per db level) without having to edit anything else. Look at redis or memsql for inspiration.

It's two commands for both physical and logical replication now. Not perfect, but not too bad either.

> Add DDL support to logical replication because I guarantee this will be the #1 source of problems for anyone that tries to use this.

Yup, I've argued just that point, and I've implemented it in a precursor of what's now in core for logical replication. I sure hope that's going to largely get into v11.

> Enable 1-line command to easily failover, remove hotstandby requirements.

Not sure what you mean with "hotstandby requirements"? The defaults? Those have been fixed in v10. Not being able to do certain things on a standby?

> Fix the poor defaults in all the configs.

We're making some progress, e.g. that everything's now by default ready to streaming replication, but there's a lot more to do. It's sometimes hard because the many different usecases ask for different defaults.

> Solid in-cluster and cross-dc replication is the reason we stuck with memsql (+ built-in columnstore)

What specifically are you seeing lacking w/ cross-dc replication? I completely agree that there's UX/complexity issues around replication, but I don't see how those affect cross-dc specifically? There's plenty of people quite happily using cross-dc replication, once it's set up.

> I've shared all of this feedback online over the years (along with many others) so the lack of understanding is really more a reflection of the postgres team and poor priority planning.

Or perhaps also a question of funding, time and communication style.

anarazel · on June 21, 2017

https://news.ycombinator.com/item?id=14578213

https://news.ycombinator.com/item?id=14373424

https://news.ycombinator.com/item?id=13653708

https://news.ycombinator.com/item?id=13423951

...

m00x · on June 21, 2017

Wouldn't multi-master replication be bad for write throughput? Depending on the scheme, it will have to be committed to all machines for every write, therefore increasing write operation complexity.

If you want to scale up your writes, sharding would be a better solution.

mark242 · on June 21, 2017

Dismissing CouchDB because it only supports pre-sharding is extremely shortsighted. It takes very little time to replicate from one db with 2 shards to another db with 3 shards, even on N nodes. Copying the *.couch files is the absolute wrong way to manage this especially in a production environment when those files will be constantly changing.

It is far, far easier to add nodes to a Cloudant/Couch2 cluster, set up realtime replication from db1 -> db2, then once the replication has caught up, at your leisure you switch your database connection in your app.

idbehold · on June 21, 2017

I'd also like to remind anyone using DynamoDB: it cannot store empty strings. Let that sink in for a second. If you think you're going to be able to store JSON documents you'll be in for a nasty surprise when you try to insert a valid JSON document and either the request fails or it converts any empty strings in your JSON document into nulls.

koolba · on June 21, 2017

An empty string (zero bytes) isn't valid JSON.

'""' vs '' (single quotes to illustrate the wrapper for the total content).

idbehold · on June 21, 2017

   {"foo":""}

That's valid JSON that cannot be stored in DynamoDB because the value of "foo" is an empty string. See previous discussion: https://news.ycombinator.com/item?id=13170746

koolba · on June 21, 2017

Ah I didn't notice the distinction between storing the serialzied form (which would be non empty) and storing the record itself.

Yes that does suck quite a bit and seems like a pretty bad decision.

luord · on June 21, 2017

My own experience working with Postgres, discovering nifty features or trying out new ones, and articles like this make it clear for me that I'm unlikely to ever need anything else. By the time any of the applications I work on could grow to the size in which Postgres wouldn't be a good option, I'd hopefully be in the process of selling it anyway.

I'm quite ok with that. It's such an awesome piece of software.

aluskuiuc · on June 21, 2017

I'm surprised the author didn't consider using DynamoDB Streams to replicate to a PGSQL database for the relational queries while keeping the primary key-value lookups on DDB.

manigandham · on June 21, 2017

What does that gain other than more expense and overhead? SQL can do key/value just fine.

dchuk · on June 21, 2017

It doesn't really seem like this guy was qualified to make such a pivotal decision like choosing the database your entire system will run on given his self-submitted breadth of experience in the opening paragraphs...

takeda · on June 21, 2017

You could say I'm a Postgres fanboy, but even to me he did not sound convincing.

The guy did not seem to know what he was talking about. Almost felt like he picked Postgres by throwing darts at possible options. Yes, he picked the right choice :), but I feel like he will misuse it[1] and few months later write article how bad it was.

[1] Heh, I've seen Postgres tables that had 4 columns:

- id

- date added

- date updated

- thrift encoded blob containing the data (that was before JSONB, but even then the data was highly relational and that was silly)

pedalpete · on June 21, 2017

I've just started using DynamoDB on a project. Before I made the choice, I discussed some architecture issues with a friend who works at a big fast growing unicorn start-up. They use MongoDB for the majority of the app the user sees, they use ElasticSearch for search, postgres for reporting. Data gets exported or updated in the data-sources that best suit the need of that data, rather than trying to find a single data solution that fits every use of the data.

jungturk · on June 21, 2017

This (polyglot persistence in Fowler-speak) is typical in situations when you have access patterns that aren't well-solved by a single data store.

It will likely introduce some serious consistency issues that need to be attended to, but that pain might be better than the alternatives.

orf · on June 21, 2017

I submitted this article because of a single paragraph:

> When it came to the different types of NoSQL databases, it was pretty clear that a document store best suited our needs. We needed to be able to query based on the data so key/value was out, there was no need to have the ability to make all of the complex connections between data of a graph db, and nobody actually knows what a column db is (jk but our data is not the right use case for it). So document it was. We would have to change our data layout a bit to really work well with most document databases, our current data structure was similar to what you would expect for a relational db with tables etc since dynamo also uses a table structure. I wanted something that was ACID, none of that eventual consistency nonsense. That took a lot of them out of the race right there.

I don't get it. He needed arbitrary querying, relations, ACID, tables. And yet... his first thought was to go to a document-based no-sql database and is only later hit with the sudden realization that his relational data is bet suited in a relational database.

What has gone so wrong here that an otherwise smart and educated engineer makes these kinds of fundamental mistakes? Is it the success of no-sql marketing departments, and the apparent un-sexiness of traditional relational databases? Is it education? The job market? A fear of SQL (that you never really have to write anyway...)? Is it the quick-fix non-relational databases give you?

As far as I can see there are very very very few nails that a PostgreSQL hammer cannot hit. I guess that's part of the point of the article, but why does it need an article to explain this?

How can we stop these kinds of articles from needing to be written?

> The change drastically increased the speed of our API, some requests are up to 6x faster than they were with Dynamo. The account details request, our previously slowest call, took around 12 seconds with Dynamo and was reduced to 2 seconds with Postgres!

sigh

Edit: Not sure why this has been moved to the bottom of the thread. I wasn't trying to single this particular developer out, I was merely commenting on this troubling trend. Apologies if this came across in the wrong way.

toomuchtodo · on June 21, 2017

"I’m not a database guy, I’m a node guy."

This school of thought that a SWE can know the entire stack, infra to app, is coming home to roost. You are the technical debt.

Learn your fundementals, and if a relational database isn't part of your foundational knowledge, learn it now or get out.

orf · on June 21, 2017

I get what you mean, but to me that's the same as saying "I'm not a HTTP guy, I'm a web guy".

I mean, sure, he may be a "node guy", but he uses node to make web applications. A large part of that is persistence, i.e databases. For a lot of applications it's one of the most important part of your stack, if not the most important part. It's your treasure trove, it's your crown jewels. Your app is just a thing that shuffles state around in the database.

He knew a fair bit about NoSQL databases, but why did he not even consider a relational one? I'm not trying to single this specific guy out, merely asking what people think is causing the trend of engineers over-using non-rel databases for relational data.

Edit: didn't see your edit. you're spot on.

eropple · on June 21, 2017

I spend a lot of time, as a consultant, talking to people who've already made technical decisions that are biting them in the ass and it seems to mostly boil down to good press and an impressive of the five-minute demo. The mental effort required to learn new stuff is scary and always looks worse than it really is; easy demos hook novice and many-years-a-novice developers to the point where, well, of course it seems obvious not to use RDBMSes, they're hard and you might have to spend time on learning them and you might even look silly or dumb (gasp!) when you're doing it.

From the outside, Postgres looks hard. It isn't hard, but it looks it, because it was not designed around a five-minute demo. I use it for everything until I have a very compelling reason not to. But I have an established library of scripts and systems to make it easy, because I am both an infrastructure guy and a software developer so the idea of "I'm going to write a Vagrantfile in two minutes flat and drop in this provisioner" is something that's top-of-mind for me. (Or I just steal the Vagrantfile from another project, whatever.) But I've made that investment and it's paying off--if one hasn't made that investment, it's scary And then doing it in production? Whether it's a hosted service like RDS or doing it yourself, you have to answer a lot of capital-Q Questions along the way and that's effort that seems unrelated to actually getting your software done.

It's not just "why didn't this person think about a RDBMS?". It's that all of these easy-answer-but-not-really systems (fill in your own) are capturing the minds of developers. The quest for abstractions and ways to make ourselves better able to do our jobs is a good one. The uncritical adoption of tools that might not actually do that is not, but that's a cultural thing not limited to datastores.

StreamBright · on June 21, 2017

Same here. We have several clients who got moved to Postgres from a randomly selected unfit technology. Funny thing is with things like Solr you can build a really fast search index as well on the top of doing ACID transactions in the db. This covers almost every aspect of data engineering. There is obviously the scaling limit with Postgres but most of our clients are in the single db instance range by the size of their data.

eropple · on June 21, 2017

Yup yup. Our consistent advice to clients:

- By the time you reach the limits of a Postgres setup, you will either be making so much money that porting it is a couch-cushions deal or your company is a smoking crater in a hillside.

- SPAs are fine. SPAs plus server rendering is often a pain in the rear. Walk, then run.

- If you can't express concretely why you want a container fleet, you don't want a container fleet. Your cloud provider's autoscaling will suffice. (This advice is different for non-cloud clients, but the overhead of leaving space in a container fleet for new deploys or for scaling is deadweight and turning around a container in thirty seconds is probably not as important as the blogs you read tell you it is.)

collyw · on June 21, 2017

> If you can't express concretely why you want a container fleet, you don't want a container fleet.

I think this applies to most relatively new (hype cycle) technologies.

eropple · on June 21, 2017

It does, but relatively few of those technologies have the desperate money being shoveled into their marketing in the way that containers in particular (and non-RDBMS datastores, to a lesser extent) have. So those tend to get called out explicitly because no, unless you are profoundly out of the ordinary, you don't need a hot new Kubernetes cluster on day one.

StreamBright · on June 21, 2017

Exactly. I have seen this container hype exploding in recent years. A quick example: I worked for a stratup who spends ~60% of their AWS budget of operating Kubernetes and roughly ~40% on their own services. I asked their chief IT guy why is that and he could not justify it with anything other than Kubernetes is cool.

dccoolgai · on June 21, 2017

This radiates truth. I would like to read more of the things you write.

eropple · on June 22, 2017

I certainly drop enough stuff around here. But maybe I should get back into blogging. ;)

Although my business partner and I have talked about a "fundamentals of cloud system design" course on Udemy or something...

StreamBright · on June 21, 2017

NodeJS and MongoDB is frequently used together and I think some fraction of the NodeJS developers thinks that NoSQL is just the better, newer, faster solution to data storage. I regurarly run into customers who hire us to make their data model work and some non-trivial cases end up being moving them to PostgreSQL. Data engineering actually is a more complex topic than some people think and it really bites you when you are actually using the solution for a while.

star-trek-fleet · on June 21, 2017

He likely has no formal training in database and therefore are not exposed to things alike.

Side topic: AWS did agressive move to throw it's resources to University, and pays off hansomely. Those new blood coming to our profession now has the concept of infrastructure ~= AWS (mileage varies between people, but that's the idea).

venture_lol · on June 21, 2017

It's way too easy for anyone to pick up Node and CRUD a few pieces of data into a NOSQL store and voila, I am an architect :)

Selecting the right framework, platform, model for current needs with reasonable forward looking support takes experience and strong formal fundamentals.

collyw · on June 21, 2017

Unfortunately job ads imply than many places prefer the former.

jopsen · on June 21, 2017

Often people don't want to do database admin stuff, dynamodb has none of that.

On the bright side, relational is coming back with RDS, cloud-sql, spanner and cockroachdb. With any luck NoSQL will be forgotten :)

misterbowfinger · on June 21, 2017

[flagged]

dccoolgai · on June 21, 2017

Easy there, Hoss. This guy is just giving an honest account of the decisions he made and how it turned out. Doesn't say anything offensive or exclusive at all. Certain attitudes are harmful to the industry and people wanting to learn and share their experiences...

collyw · on June 21, 2017

"I’ve been writing node for 4 years and like any other true node fanboy, if I need to use a database, I go straight to NoSQL."

Yes, thats the way an engineer should be making decisions about arguably the most important part of the application.

dccoolgai · on June 21, 2017

So everyone that makes a bad decision and openly takes their time to write about it so others can learn from their mistakes is a "brogrammer"? Great engineers become great engineers by making mistakes and learning from them.

collyw · on June 21, 2017

My problem is with the reasoning behind the decision (or rather the complete lack of it).

(And it wasn't me that used "brogramer").

virmundi · on June 21, 2017

Keep in mind that the Postgres hammer is a non-standard hammer. Sure it is pretty SQL compliant (except there are no dirty reads thanks to MVCC), but if you use JSON or the arrays or the many other wonderful features of Postgres, you are now tied to Postgres. You are also slowly inching your way into Document Stores.

orf · on June 21, 2017

The same can be said about any database though. No one is like another.

SQL as a standard is roughly followed, and PostgreSQL is very compliant. That's more than can be said about any non-relational database - each has it's own query language.

I use JSON fields and arrays quite a bit, but that doesn't mean I need a document store. It means I have small portions of data that is non-relational, or is better suited to use arrays/json than strict tables, but I still want everything else consistent, enforced, and also have all the other bells and whistles Postgres gives you (FTS, partial indexes, rich types everywhere, ACID, the list goes on and on).

Being tied to Postgres is often better than being tied to a document store. At least you can dip your toes into non-rel as little or much as you like (where it makes sense), rather than diving in at the deep end and realizing you can't swim.

virmundi · on June 21, 2017

What about ArangoDB? You can perform joins, even across the cluster. It's a document store. You don't have transactions across the cluster (yet, I'm pretty sure I'm right). You get sharding and distribution automatically. Again, at this point you're out of the SQL world, why not go fully out?

eropple · on June 21, 2017

I can't think of a datastore whose purely SQL-compliant feature set is worth using; to that end, using any RDBMS with a reasonable feature set is going to be a product lock-in of some kind. And I've been around and babysat a lot of people's Mongo stuff; I'll die on the hill that being tied to Postgres is a hell of a lot better for the 10th, the 50th, and the 98th percentile of developers (and companies) than a MongoDB or the like.

virmundi · on June 21, 2017

When I design systems I try to stay SQL compliant. This has benefited me in the past: From SQL Server to Oracle. Internally one can use optimizations like indexes unique to a DB, but in the end the client view is entirely only SQL.

tootie · on June 21, 2017

The biggest pro to Dynamo of Postgres in an AWS world is just price. It's very attractive to try to shove a square peg in a round hole to save money. Of course, you end up costing yourself far more in the long run if it's such a bad fit.

tdurden · on June 22, 2017

The author will hopefully look back on this post and cringe a little. All the requirements point to a relational database, but that is not realized quickly because "if I need to use a database, I go straight to NoSQL"

williamstein · on June 21, 2017

This closely mirrors my experience a few months ago: https://news.ycombinator.com/item?id=12649712