Hacker News new | past | comments | ask | show | jobs | submit login
Flexible schemas are the mindkiller (mataroa.blog)
79 points by l0b0 8 months ago | hide | past | favorite | 64 comments



I once worked at a place that had many different specialised applications serving different needs and serving different data.

The incoming CEO who wanted to make his mark (let's call him Derek) had an impressive background in marketing decided that the right thing to do was to rationalise everything into one huge database.

They employed a small army of consultants who covered the entire office (it was a big office) in huge database schema diagrams. This took several months. Eventually they "discovered" that the only data that the schemas had in common was a reference, a description, and a created date field.

So they immediately cancelled the project and we all breathed a sigh of relief. Haha, only kidding. Of course they didn't. The right solution was to go schemaless. Enter mongodb. They would achieve full flexibility going forward, and wouldn't have to subside a small diagramatic wallpapering industry as a bonus.

At this point I was no longer involved and just watched the slow train wreck from afar. Years later all the applications had finally been migrated to use the new OneDatabaseToRuleThemAll™ system. I don't think any new functionality was delivered over that period. There were persistent performance issues for the larger data sets. Changing anything required the code to support all previous possible schemas because they never migrated any data (it's schemaless, it's all super flexible!).

I think Derek left to go ruin somewhere else, with a huge success story on his CV.


Firstly, AAAAAAAAAAAAAAAAAAAAAA.

Secondly, yes, Derek left for another job immediately afterwards. And he came to us quite highly recommended.


I think Derek is now busy slapping some random AI onto his current database woes, to further improve his CV.


Let's call him another name...


One useful piece of rhetoric is using "schema-on-read" instead of "schemaless". It helps make it clearer that the schemas aren't eliminated by using a "schemaless" data store, you're just pushing the responsibility of data validation into all of the data-consuming applications. Shift right!

I think i encountered "schema-on-read" vs "schema-on-write" from the book designing data-intensive applications


Author here. Many of my readers have suggested I read Kleppmann, but I haven't looked at it yet since there are so many good books to go. I didn't realize this was covered in there, but it has shifted the book to near the top of my list, after some TDD and XP stuff.


Thanks for the tip. Persuasion isn't my forte and I don't think I've ever seen a pro-schema argument win against a let's-make-table-soup team lead.

It is so frustrating when people inevitably converge on an obvious schema that they just don't want to let the database know about - or develop the Things table that re-implements database schemas poorly. The old table that is clearly several other tables mashed together.

This data is going to be the longest lived and probably most valuable artefact of the company. It is nearly free to put a schema on it, it improves correctness and performance more or less for free. Yet still it isn't possible to get consensus that telling the database the data schema might help.

People don't realise how inefficient JSON blobs are either. I swear there must be AWS databases losing a significant fraction of their IO budget to re-parsing what should be column names with every data point read. I can only hope that postgres devs have rolled their eyes and optimised the case where people don't know how to express tables in SQL.


> there is a certain type of person that probably has all the mental horsepower required to be a phenomenal engineer that simply gets stuck on their "elegance"

and

> I do not understand how they are both smarter than me in many respects, and then still don't understand how stupid this all is.

This is such a good question. The absolute worst codebases I've worked on have been created by brilliant individuals. I remember spending months tracking down bugs leading to inconsistencies in the universal "Things" table. "Everything is a thing! Right?!" Wrong...


Just though of another one! This worked surprisingly well and was really annoying at the same time.

The huge application consisted of a single PHP file. If you guessed that it was tens of thousands of lines long, you'd be wrong. It was only a few hundred lines, if not less!

The trick was storing code in the freaking database. The mother of all index.php's would simply query for the correct piece of code based on query parameters and eval it! They had an editor where you could make live code changes to the page, and even some rudimentary version control. Amazing times!


I'm experiencing moderate discomfort just reading this


I think a lot of devs go through a "I'll just store code in the DB and evaluate at runtime!" phase. Some of them never get through it though.


Before the web, the typical "client/server" application was a Windows program with a GUI that invokes queries and stored procedures directly on the database and shows the results.

If someone is tasked to port a similar application to the web, writing a PHP application that does the same is not a bad idea.


To be fair, once upon a time, it was actually thought of as a good idea to use databases for implementing version control.


The pros use JDSL and checkout their code live from Subversion.


> The absolute worst codebases I've worked on have been created by brilliant individuals.

+1 to this. Worked on a codebase where the (lone, unsupervised, I'm assuming) developer had created their own ORM and object model. Impressive! Except neither had been updated in the (5+, IIRC) years since the original dev left because they were hideously complex and fragile and thus increasingly a drag on getting anything done (which is why management were Very Keen on a total rewrite in A.N.Other language.)


That could be a symptom of a coder who has been successful in projects that hold everything in RAM, and doesn't understand, or underestimates, databases.

"Everything is a thing" works well in RAM; we have nicely working programming languages based on this. What could go wrong if everything is just a thing in the durable storage?


> there is a schema, it lives in their incomprehensible code

I’ve started to think of this as a diffuse schema. I.e. the schema never really goes away in “schemaless” databases. It just spreads throughout the application in helper functions, mappings, and backward compatibility hacks.

Perhaps the distinction is between diffuse and consolidated or implicit vs. explicit schema. Are there any similar models or articles that have further described this realization?


I am surprised no-one mentionned the to me obvious explanation for Derrek's behavour: wilfull obfuscation, at-least partly aware of the consequences, with a calculated maximization of the consulting fee.

If the incentive structure is set up in such a way as to provide an easy local maximal for the individual that is not a good outcome for the compoany, then it is a management failure.

Having such a situation where all individuals up the management chain are doing short-term local maximization that medium term leads to a bad outcome for everyone is societal failure.


Derek wasn't paid spectacularly well (though above the national average), and also slapped a bunch of copyrighted code into the codebase, then uploaded medical data to GitHub. All of which could have instantly ended his career if the other engineer had the time to review his work sooner.

I think there are some (many?) people that willfully obfuscate - I have been this person when dealing with narcissists or saving management from themselves - but I just don't see a version of this where he was a very savvy operator.

To quote Patrick McKenzie, the correct calibration is that some amount of evil exists in the world, but the whole world isn't evil. I.e, this is Incompetent Derek, and at some other company is Evil Derek.


Tangentially related: it now takes our data eng team 6 months to mirror some tables (into databricks) due to data vault modelling... presumably to handle schema changes. And then at the end of it everything is riddled with duplicates and missing data because they don't know what they're doing. But none of the source-systems can do schema evolution anyway, so we know the schemas would never change.

I think as an industry we should stop warning juniors of 'premature optimisation' (kids aren't even choosing the right data-structures/algos/architectures and are getting terrible perf), and instead warn them away from premature scalability and premature 'flexibility'.


Obviously, if you actually have a defined schema then using anything like EAV or JSON support in your relational database is a bad, if not outright terrible idea. Your queries get a lot more complex and you lose most of the type safety a rigid schema provides.

But there are cases where you need flexibility, and the very categorical dismissal of EAV and anything similar is not particularly helpful if you find yourself in the situation where you need that kind of feature. It's a lot better today with good JSON support in relational databases, but even that doesn't give you good index support unless you give up on some of the flexibility. EAV is actually superior in that aspect if you don't put all your values into a string column.


Author here. This is fair. Karwin's book also acknowledges there is a time and place for EAV, though I judged that it would detract from a post that was already quite lengthy.

There's a saying I heard a while ago which I really like, which is "reverse all advice". Almost all advice, such as "be more careful about what you're eating" could be good for one person and bad for another. In our industry, I see EAV applied stupidly far more often than I see it used appropriately. In fact, I've never seen it used appropriately, even though such cases exist, so I err on the side of "your prior should be heavily anti-EAV - please take note of all the skulls down this path and think carefully before proceeding yourself".


Lone wolf developer in a small organisation with minimal supervision. Seen it before. I maybe even was one once, way back.


I'm that one. At least I put my code in repositories though.


Shoving triples into SQL is braindead, but theres a multi-billion dollar industry around triple stores, graph databases, and RDF.

Datomic is in the same group, and I'd consider Rich Hickey to be one of the best programmers there are.


Author here! I love basically everything Rich Hickey does (which is probably not a sign that he is always right, but that he's so much smarter than me that he can convince me of anything). Simple Made Easy was so influential on my early career (which I am still in) that I probably am far too suspicious of all "easy" pathways.

However, Derek was no Rich Hickey. Such deviations from the norm should be left to the type of person that can create Clojure, and very far away from the people that upload PII to GitHub. If we could create a culture where we slapped people's hands on instinct when they reached for MongoDB, I think we'd be better off.


I'm not arguing against that, and I hoped to make that clear with the first sentence of my comment. The post felt like a rant against triple normalization in general, and not that particular bastardization of it, so I tried to give some counter examples.

The PII thing is on an entirely different page, and indeed sweat inducing :D


Oh for sure, I did read your first paragraph that way. I think what I was trying to communicate (badly) is the message that I want to ingrain in engineering culture writ large - don't let anyone do this unless they're Rich Hickey.

That's calibrated too far the other way but there'd be less damage.


Toally! I think "don't work against a tool, use it like it was designed, or use a different tool" is good advice in general.


There is a difference between abusing a relational database versus using something that is key-value from the ground up.

Notice how in the example you have:

  1  Name  Ludic
  1  Age  29
  1  Profession  Tortured Soul
The key is not unique. There is no primary key. So to hunt down all the properties of User 1, you have to do a query for all the records whose Key is 1.

I think that doesn't happen in keyword-value stores. Your key 1 has to be unique: it retrieves one blob, and that's it. You have to stuff all the properties into that blob somehow: for instance, by treating it as an array of the keys of other blobs.

Derek could have used multiple tables. That still gives you all the flexibility. Just that if someone wants to invent a new property, they have to add a table.

Then we have

  Name table:       Age table:     Profession table:
  1 Ludic           1 29           1 Tortured Soul
The keys are unique: we fetch key 1 from each table, and we have the three properties. Not great compared to fetching one row with three fields, but better than stuffing everything as rows into one table.


> I think that doesn't happen in keyword-value stores.

It absolutely does happen in triple stores, though, where data is commonly stored in subject-predicate-object form, and the subject's identifier is certainly meant to be unique to that subject, but not per triple.


> Notice how in the example you have:

> 1 Name Ludic

> 1 Age 29

> 1 Profession Tortured Soul

> The key is not unique. There is no primary key. So to hunt down all the properties of User 1, you have to do a query for all the records whose Key is 1.

In this example, the primary key would be ([User ID],[Key]). The primary key does not need to be a single field.


You have just invented 6th normal form. This is not a bad thing.


It's not a bad thing on SSD storage maybe. On spinning platter hard drives, we may have to suffer a head seek time to get each field of the object from a separate record in the separate table. Whereas the fields of a record can be stored close together.


“Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation)… Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed…”

Roll the related data up into materialized views for read performance.


Yes, BUT. For analytical workloads you often want the locality to be within columns, and not within rows, we call these things column stores ;).

Database design has many tradeoffs, and there is no free lunch.


Datomic is a tragically unoptimizable black box, one that spills candied joys when you start using it, gradually transitioning to the unspeakable horrors of Pynchon's "The Disgusting English Candy Drill".

IMO the real tragedy is that Datomic predated the open-source-but-you-can't-host-it-or-otherwise-operate-it-for-profit licenses that have proliferated recently.

If I ever get the time, I'll write a compiler. And if I have time left over on sabbatical, I'll reimplement Datomic as a Postgres extension (goddamnit).


> I do not understand how they are both smarter than me in many respects, and then still don't understand how stupid this all is.

Met this kind of person three times so far, been asking myself the same question. I suspect they live too much in their own head.


Derek is not smarter; he's an idiot.

Only a completely idiot would use copyrighted code from their previous job, and work for months without putting anything into version control.

And, above all, not have anything actually working, like forms not actually saving data, after months of work.

Nope, bozo.


I too have had the pleasure of working on a project that had tens of microservices without defined interfaces, sharing a database (and occasionally reaching into other service’s tables directly) with hundreds of thousands of lines of code and an Angular SPA. Oh and about a dozen poorly-written tests in total.

It was obviously a complete trainwreck, but it worked enough to convince multiple executives at different companies to actually pay for this crap. Sure it went down every week and was leaking data left and right, but it looked legit.

I clearly don’t appreciate the architecture or the reliability of it, but the tenacity of the people able to build such a monstrosity day in and day out, somehow wrangling it into shape against all odds. That’s the sort of mental willpower I wish I had.


Arrogance is a hell of a drug. The Dereks I met were all, in their own way, insufferably arrogant. Probably to pave over insecurities, but I don't really care as to the causative mechanism. The effect in each case was that they are so full of themselves, that they displayed an inexhaustible energy in maintaining and defending their Rube Goldberg shitfests.

Of course to scram once the going got rough.


Yeah arrogance is indeed a unifying trait of these kind of people.


And yet by all accounts management thinks he is great and chances are Derek is out there somewhere in the world right now as a well-paid journeyman. That’s where this industry is bonkers: it can’t tell good from bad, productive from unproductive, and it creates trauma and distrust for everyone.


"Velocity is down. Maybe you need a second scrum master to help the team prioritize and align?"

In the words of TFAs author:

AAAAAAAAAAAAAAAAA


Yeah, I need a better vocabulary for this. It's like seeing someone who has the height to play for the NBA, and they spend all day on the court, but somehow never take the time to learn how to dribble or shoot.

Note: I don't know anything about basketball.


This is all very true, but whats weird is they seem to lack the pain response from their own creation. This hell of accidental complexity doesn’t seem to bother them, which means they can somehow handle it. I suspect this is what the author meant.


I hadn't reasoned my way to this exact mechanism, but I need to reflect on this. You're right that they seem to have no pain response to their own work. Unfortunately, I don't have a theory of mind for this because I feel like my own work leaves me in agony about 50% of the time.


I suspect the idea of effort being valued in itself has something to do with it. The more effort it takes, the more worthy must be. The alternative is to admit that the work was wasted, is wasted, and will continue to be wasted until the org dies or someone gets so disgusted with the status quo that they invoke the gods of sunk cost, flushing the whole thing down where it belongs.


Yeah. I've seen these types too, and what I find when I start working on fixing their sorry excuses for systems is that they don't even work. Because they're not geniuses, just like I can't work with their crap confidently because the spaghetti is too complex to keep track of, they also can't do that.

So once you write a test harness in order to make sure your refractors don't break things, you start to notice that the things you're testing are wrong. The code doesn't work correctly, it is littered with bugs.

Actually just finished a refactor like that. In a fairly small refactor I found multiple things that weren't working correctly. It was sending duplicated messages, it was incorrectly labeling messages with duplicate sequence numbers, and lots more.

I wrote equivalent code to replace it, spending roughly half the number of lines and getting rid of a complete shitload of unnecessary method parameters and other pointless stuff. Also fixed a hot path with an unnecessary O(n^2) complexity (where n could be pretty high) and a bunch of other dumb crap.


I’m working with a Derek right now. Highly motivated and highly incompetent. Usually it’s one or the other. How do you deal with people like this?


I don't think you can, unfortunately. I've started my own company which does very low revenue at the moment, and that's the only solution I have found - handpicking the whole team from scratch. I've spoken to a few executives and they have convinced me that performance managing someone out of the organization is so difficult that it's basically an empty threat in many countries.


The usual recommendation is to get rid of them https://quoteinvestigator.com/2014/02/28/clever-lazy/


Those are some great quotes!


This is such a good article. Poor Derek probably lacks communication skills or any firm mentorship or a spine.


@lucidity can you give an example of a database test with views?


Oof, not off the top of my head. I think that's something I wrote about in an older post about an engineering interview I sat, but if you hit up my email, I can speak to one of my co-directors who has done loads of stuff in this space.


I have a question about this, that I probably can't explain well in a single comment. Normally I have well-defined schemas for my data, and obviously prefer this for its uncountable benefits.

However, over the years I have occasionally and reluctantly used the 'Derek table', for importing domain data with a dynamic schema, aware that I will be paying dearly for it.

My question is: What alternatives are there..? Other kinds of databases, maybe those used for 'big data'?

I must clarify:

(1) I know I could instead create proper db tables on the fly - this way, I _can_ have varying columns depending on each new set of domain data I import. However, IF I do this, I will now have to dynamically build my SQL queries, to refer to these varying tables and column names. The BUILDING of those SQL queries is not to fear, but the query execution of them is, since each new variant is a not previously seen db execution plan, and some of those will hit weird performance problems. I am painfully aware of this, because I have worked(still do) 'lifeguard duty' on a production database, where I routinely had to yet again investigate how a user this time had created a dbms-choking query with exactly this technique.

(2) In the derek approach, the approach would usually be to violently retrieve 'all the query-relevant records' (e.g., the contents of 'project'/'document'), and then do the actual calculation in code, e.g. C#. This of course has the downsides of

(2.1) - we must haul huge (relatively speaking) amounts of raw data off the db, since we are not summarizing it to the calculation-end-result before-hand. But this is also the 'benefit': We know we won't bother the db further than this initial and rather simple haul/read. (I AM aware I could also query the derek monstrosity directly, but I have seen enough of that to know to avoid that, to not bring the dbms to the curb.)

(2.2) This is an extension of 2.1: Since we are working with the raw data, we must 'pay' both for moving the big chunk of data over the network, and also/often for having it in working memory on the actual external processing/calculation server (this may not be true, if the calculation can be done piece-wise working on a stream). And, of course, there is the entire cost of 'a single table cell is now a whole db row'.

Echoing poor Derek, the benefits of the described approach, is that it actually takes relatively simple approach and code to build the solution this way, at a tremendous cost to resources/efficiency. If I did 'the right thing', I would have to write considerably more and more complex code, to handle the dynamic DDL/schema processing, to dynamically work with the real DB schema.

Back in the 90's, I would by necessity have 'done the right thing', since the Derek approach was doomed performance-wise then. But now, in the 2020's, we have so much computing power, we can survive - for a time - by wasting extravagant amounts of resources.

To recap/PS: Whenever I have done this, it has been for a small subset of specific data; I have never done it in the insane "one single table", with dynamic tables too. My case has always been 'dynamic fields'.

Also, for context: The 'calculation' to be done, typically amounts to what could generously be referred to as a bit pivot-table operation on a heterogeneous set of data (which is why, expressed on proper SQL, the query would be rather verbose and unwieldy, accounting for all those heterogeneous and possibly-present fields on the different source tables/datasets)


Amazing writing! Reminds me of the Cuckoo's Egg for some reason.


Do read the rest. The one about him saving his employer a million bucks & getting a gift card as a thank you is riotous

https://ludic.mataroa.blog/blog/tech-management-has-bestowed...


Very kind of you to say this, thank you! I'd never heard of Cuckoo's Egg but it sounds like it might resonate with me.


Read it immediately


Agreed, really enjoying their writing. And keep trying to guess which country they are in.


Melbourne, Australia. :) I don't hide this so much these days because it was stopping me from meeting cool people, and all the people that would get angry at me don't read Hackernews.


So derek left after 8 months without delivering anything useful. The data was in a format that prompted immediate conversion to a correct relational database model.

You then took over and worked for ~8? months, taking good pay, working from home for a substantial portion, complaining about a not perfect air conditioner, and left as soon as your paperwork got approved.

By the end of you time there, did you deliver something that they could put into use?

Of course those first months are painful, as management was unaware of the unusable state of the code.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: