Schema on write is better to live by

simonw · on Aug 21, 2021

I've come around to almost the opposite approach.

I pull all of the data I can get my hands on (from Twitter, GitHub, Swarm, Apple Health, Pocket, Apple Photos and more) into SQLite database tables that match the schema of the system that they are imported from. I call this family of scripts Dogsheep - https://dogsheep.github.io/

For my own personal Dogsheep (https://simonwillison.net/2020/Nov/14/personal-data-warehous...) that's 119 tables right now.

Then I use SQL queries against those tables to extract and combine data in ways that are useful to me.

If the schema of the systems I am importing from changes, I can update my queries to compensate for the change.

This protects me from having to solve for a standard schema up front - I take whatever those systems give me. But it lets me combine and search across all of the data from disparate systems essentially at runtime.

I even have a search engine for this, which is populated by SQL queries against the different source tables. You can see an example of how that works at https://github.com/simonw/datasette.io/blob/main/templates/d... - which powers the search interface at https://datasette.io/-/beta

dwohnitmok · on Aug 21, 2021

This is also the direction a lot of data engineering is going, where systems are going from ETL to ELT (that is loading data with minimal transformations as soon as possible and doing complicated transforms after that). This is e.g. why tools such as dbt are getting popular.

But I suspect OP's point is probably that you do at least impose a schema on import rather than letting it just be featureless JSON blobs or EAV tuples, which makes the job of understanding that data later so much easier.

nerdponx · on Aug 21, 2021

+1 to both. "Store things in a schema that reflects the origin" is schema-on-write, just a very particular kind of it.

specialist · on Aug 22, 2021

> systems are going from ETL to ELT

Interesting. I did this for medical records, mid-aughts.

The phrase I used was "transform & translate at the last possible moment". Ideally in the reporting step (server side rendering the HTML).

Since all of our customers kept changing their minds, totally their prerogative, my strategy minimized cost of change and turnaround time. Prior products would have to "rerun the data" from scratch, for schema changes. We'd make the same query tweaks, but at the end instead of the beginning of the pipeline. (Hardest part was always nailing down the requirements, expectations. Then the code basically wrote itself.)

This strategy made people's heads explode. Old schoolers just couldn't abide by it. Despite our small team running circles around everyone else. Data must be normalized! All the same! We had to be cheating somehow. They did everything possible to destroy us, and eventually succeeded.

fiddlerwoaroof · on Aug 21, 2021

I’ve had a lot of success having a Postgres table with a json blob and then a bunch of columns of interest, populated by a trigger.

tomnipotent · on Aug 22, 2021

This works surprisingly well.

I end up just doing both - having one table store the blob, and another with the copied values (shopify_sales, shopify_sales_jsonb).

Databases are generally pretty good at handling large BLOBS/off-page heap values, but I prefer not to tempt the fates of buffer pool management.

infogulch · on Aug 22, 2021

Have you heard of steampipe? It's a postgresql foreign data wrapper that exposes apis as virtual tables. It was recently discussed on an hn thread about pg FDW: https://news.ycombinator.com/item?id=28227905

I feel like this is a pretty good strategy, since you can either query from the api directly -- which seems to be surprisingly effective -- and drop down to materialized views if you need to cache the results for heavily repetitive queries.

avnigo · on Aug 22, 2021

Just wanna leave this [0] here: simonw's keynote on Datasette and Dogsheep at csv,conf,v6 was excellent and entertaining.

[0]: https://www.youtube.com/watch?v=UFn82w-97kI

hrishi · on Aug 22, 2021

Datasette looks really useful, the introduction [1] is well done!

1 - https://www.youtube.com/watch?v=7kDFBnXaw-c

thunderbong · on Aug 22, 2021

You do realize that the person you're replying to - simonw here on HN, is Simon Willison - the creator of Datasette!

hrishi · on Aug 22, 2021

I do! I hadn't come across it yet, and it's always great to see a creator introduce his work well, great work all around!

mixedmath · on Aug 21, 2021

What are some useful ways you've used your own data?

simonw · on Aug 21, 2021

To be honest it's mostly for fun, and to help me dogfood https://datasette.io/ and come up with new features for it.

But just a moment ago I was trying to remember the name of the Diataxis documentation framework - I was sure I'd either tweeted about it or blogged it, so I ran a personal search in Dogsheep Beta and turned up this tweet: https://twitter.com/simonw/status/1386370167395942401

Someone asked me the other day who they should follow on Twitter for Python news, so I searched all 40,000 tweets I have favourited for "Python", faceted by user and told them the top four users.

jtsiskin · on Aug 22, 2021

This is very different from what the article is talking about. This falls 100% under “schema on read”

tomnipotent · on Aug 22, 2021

If you're taking a JSON API response structure and creating an exact 1-to-1 mapping to a SQL schema, you're doing schema on write.

If you store the JSON in a JSONB column, and then create projections as needed by parsing fields from the BLOB, then you have schema on read.

simonw · on Aug 22, 2021

My code automatically creates a column for each key in the JSON object, adding any columns that are missing.

If a key represents a nested object, I dump JSON for that object into a text column (since SQLite doesn't have a separate JSON type).

What I'm doing doesn't feel different enough from "dump it all in a JSON column" to justify different terminology.

Here's an example of my results: https://github-to-sqlite.dogsheep.net/github

tomnipotent · on Aug 22, 2021

I consider schema-on-read to be the property of a system that retains the original response of a remote system call (mostly) unmodified, such that any derived values can be re-created to account for mistakes/omissions.

Doesn't necessarily mean that you have to project your schema dynamically every time from the original data, just that you have that option. If you're able to (mostly) re-create the original response and recover from any error, I think that's close enough to schema-on-read though the pedant could disagree (easy examples of fp precision come to mind).

I think of schema-on-write in the context of data from a remote system being copied to a local system such that mistakes/omissions require fetching data from the remote system to reconcile.

It sounds like you've also tackled dynamic schema migration from JSON graphs to Sqlite relations, minimizing relation count by keeping nested graph objects as their JSON string representation.

simonw · on Aug 22, 2021

That's a really clear explanation, thanks!

I think what I'm doing counts as schema-on-read then: the tables I am creating can be used to recreate the exact API response that I recorded, because there is a one-to-one mapping between the columns and the keys in that original JSON.

jtsiskin · on Aug 22, 2021

Your using the database definition of “schema” which is orthogonal to the definition in the article. Use the example of the browser’s bookmarks list. The browser is storing that with a DB schema. Yet an ever growing list of bookmarks has no “conceptual/knowledge base” schema

quelltext · on Aug 22, 2021

Yeah, that's why the parent comment says they are doing the opposite from the article.

johnorourke · on Aug 22, 2021

(commenting specifically in the context of a personal information collection)

Schema-on-write increases the cost (mental load, time) of finding useful things in the first place, which isn't always what you want. I happen upon an interesting HN article (like this) and (like the author) dump it in my "later" pile - often that's during a little ADHD-brain distraction, where I just browse and read for a few mins before jumping back into a piece of work. I specifically DO NOT want extra cognitive load at that point.

The balance is always a bit of both - sometimes organise, sometimes defer it. But always allocate some time for "emptying" those queues, lists, inboxes, etc - that's the critical part of staying organised and usually the hardest.

The worst temptation is to automate the process - eg. "I'll auto-import all my data into organised tables" or "I'll use ML to automatically tag stuff" - that's fine, but the purpose is to help you learn and remember, and that automation will just create even more data for your "to look at later" pile.

I'm having some joy using Obsidian - it's so quick for creating interlinked documents and notes, task lists, etc, and from a technical PoV they're all just plain Markdown documents stored in Dropbox - future-proof and device-proof.

hrishi · on Aug 22, 2021

This has exactly been my experience. Automation hasn't helped much - we're still not at the point where (as someone else here pointed out) you have a system that can mimic your mind and add the metadata for you.

Clearing out the lists make a big difference. I'm reminded of a video I watched [1] where deleting photos more often made it easier for the author to remember his life better. I've been trying it more, and he's right - it works!

1 - https://www.youtube.com/watch?v=GLy4VKeYxD4

mceachen · on Aug 22, 2021

Sure, delete the cruft, but if you've got "too many" photos and videos, maybe it's not you: maybe it's your photo app.

I believe I have a novel approach to browsing your library that you might like. My users love it.

https://forum.photostructure.com/t/love-random-shuffle/790

tjoff · on Aug 21, 2021

This hits home in lots of ways and I'm going through a similar struggle and realizations.

Thing is, I used to have an excellent memory. I could recall a four year old blog post I read once and instinctively know, whether it would help me with my current task (or enough of it to solve it right away).

But now my memory is failing me (overload? stress? who knows). I still instinctively know I've read something related to something but I can't remember enough for it to help me, nor to find it again. I can't remember the punch-line, or if there even was one. It makes for the most boring and cringe anecdotes you've ever heard. "Oh, yeah! I read about that, there was a thing and then there was a conclusion. I'm not sure on the thing and the conclusion could go either way." Worthless. Yet I still go for it, because I'm used to remember enough of it for it to be helpful/relevant. Now I can barely trust what I do think I remember.

I've come to the same conclusion. Schema on write. But as is noted "It is a lot of work.". And I'm struggling. Because previously the very useful and rewarding "hoarding" (wasn't set out to hoard information, was just a side effect driven by curiosity) was dirt cheap. But now it takes a lot of work to condense, and it isn't at all clear that the effort is worth it. For sure it depends. But even the act of looking up condensed information tilts the scale a bit, you ~need to know that the thing you are looking for exists in the first place or it might just become another distraction. But it doesn't matter, I'm too tired for it anyway and my mind refuses to adapt. Wanting more, but I can't cope and gets nothing done.

More than a schema and order I need balance, perhaps by (forcefully?) constrain myself. Someone on HN tried text-mode-only linux ( recommend the read/skim https://dev.to/jackdoe/tty-only-1ijn ). Maybe. I am lost. (hence the ramble)

hrishi · on Aug 22, 2021

I'm motivated by the same experience. My memory is fading a little, perhaps due to the sheer volume of things I'm trying to remember, or age.

The part about anecdotes hit home, where I'll remember the description for something but not the name, or the name and how it's connected but not enough to really explain it. "I remember it made sense when I was reading it, you'll need to trust me" is not that great.

Agree with the subcomments, the work involved becomes smaller and smaller as your brain gets better at it - somewhat like journaling. If you're willing to put in the extra work at the beginning, it's also a great memory booster. Your brain learns to understand that you intend to remember something, so the connections feel stronger. I definitely recommend it. Even something like a short journal at the end of the day makes a big difference.

The biggest thing is being okay with letting some things go - which has been hard for me, yet has made a huge difference. If you don't have the time to properly put it down to be remembered, accept that it's likely to be lost and move on. If it's important or interesting enough, your mind will remember without the need to save it.

It's a tiny, tiny slice of things that I can ever pull out of the knapsack. Either I've forgotten enough that I don't have enough metadata (or even that they exist), or I remember enough that I can just find it again.

nerdponx · on Aug 21, 2021

Schema on write becomes part of the daily ritual. Block out time to check emails, block out time to clean off your desk, block out time to save all your links and notes from the day. Ignore the voice in your head telling you this isn't productive time, because it is definitely productive. The ability to clock out with a clear head is liberating. And even the process of going through links, notes, etc. and categorizing ("filing") them helps me remember things anyway.

stadium · on Aug 22, 2021

Schema on read means not that schema shouldn't be written. Write the schema, keep it safe, but decouple the schema from the data.. that's it.

bccdee · on Aug 22, 2021

I somewhat disagree with this.

I've tried to organize notes and documents schema-first several times, and it's never worked. For instance, consider a system of notes where each note has tags: I write a note, come up with some tags for it, then I write a second similar note and end up thinking "hang on, I feel like I chose the wrong tags for the first one." So I have to come up with new tags for both.

Either I'm constantly revising how I've tagged older notes, or I have to think so hard about how I'm going to tag each new note before I write it that I never want to write any notes. Before I write anything, I need to think up a set of tags which are specific enough to be useful but general enough to re-use.

Instead, these days, I use Logseq [1], and I'm really liking it. In Logseq, everything is bullet lists. If you have a new note to write, you just append it to your root list. If two notes feel like they belong together, indent them under a parent. If any list starts becoming too bloated, you can move a subtree to a new page.

A list like

    Main List
    - Entry 1
    - Entry 2
    - Entry 3

becomes

    Main List
    - Entry 1
    - Category A
      - Entry 2
      - Entry 3

which becomes

    Main List
    - Entry 1
    - <hyperlink to "Category A">
    
    Category A
    - Entry 2
    - Entry 3

This way, the schema grows organically over time, from the bottom up. Instead of having to think up a system of classification before I start writing, I just write, and then classify later as information accrues.

Logseq has tags and block embeds and many other features too, but the core nested list model is what has really attracted me to it. I'm sure it's not the only note app that works along these lines (I'm always open to suggestions), but it's open-source and it works quite nicely.

[1]: https://logseq.com/

hrishi · on Aug 22, 2021

Definitely trying out logseq, thanks!

kthejoker2 · on Aug 22, 2021

It's very simple:

Schema on write is upfront costs for lower per unit costs later.

Schema on read is no upfront costs and then higher per unit costs thereafter.

The value of schema on write is based on how much lower the per unit costs will be, how many units you expect, and how much the upfront costs will be.

Most people in ELT circles completely discount the per unit costs (many vendors are indeed incentivized to maximize these costs)

And the flipside is a lot of ETL is somewhere between YAGNI and unnecessary extravagance.

But broadly anyone who recommends one or the other without showing even a back of the napkin analysis of these costs are not serious data engineers.

Ps

Three other real world metaphors for schema on read vs schema on write:

* renting vs a mortgage

* eating out every night vs learning how to cook

* (pre pandemic) living far away from work to save money vs living closer to work.

Timothycquinn · on Aug 21, 2021

To OP - You sound kinda like me. I went over to Ecco Pro when I started doing programming as the amount of information coming into my head was way too much and I needed something that had low overhead seamless moving and organization of data. I found A single panel outliner (like Ecco Pro, Omni outliner...) works great for my brain.

Every time I download a notetaking app to try a new system, I pray it has an single panel outlining feature but nope. I can't count the number of times I considered writing my own.

Now you have mentioned Notion, I checked and it has an outline mode built right in. Woo Woo!! The only thing left to sell me is Linux support which is my primary DT env.

Thanks for the post!

hrishi · on Aug 22, 2021

I do miss Ecco Pro, Notion has been a decent replacement. If you want offline and caching, Dendron might be better.

The apps are improving, but for now the web interface is pretty good for a cross-platform experience.

throwthere · on Aug 21, 2021

This is kind of the "organize your file structure versus throw everything in a single directory and rely on full text search" argument from the old days when everything important was saved in a file.

I think if most of what you're organizing is just queued things to do-- videos to watch, articles to read-- your problem isn't organizing those, it's throughput (focus).

nerdponx · on Aug 21, 2021

Sometimes you don't ever need to watch the full video. Maybe you just want to remember the existence of a specific snippet from 47:31-49:55.

pjdkoch · on Aug 22, 2021

I think the schema is already there. You can read and understand it, right? So its has more structure for you than just entropy.

How to capitalize this is indeed a harder question. My mind gravitates more towards something like GPT3. If you fed it all your memories (expensive, I know) and gave it a prompt of "I was thinking about a funny gif with a dog and a mug in a room on fire", it'd for sure get you the this_is_fine.jpg. But the answer is not quite what I typically see myself wanting. Instead, I want to navigate this graph (the understanding necessary to make the connection from the prompt to the jpg in the first place) myself, and see what's interesting and non-obvious there. That's what I interpret as "being able to talk with your zettelkasten".

hrishi · on Aug 22, 2021

Some thoughts: the schema is only partially there, at least for me. I can read and understand the content, but if I'm removed enough from when I first read or stored it, I've lost the context in which I was seeing it and the relationships within.

Having something that models the mind is the absolute goal - that would be amazing. It's definitely possible, and a lot of this labour might equate to needing to have a good sense of direction and map-reading skills before we had GPS.

snidane · on Aug 22, 2021

If you believe in ELT over ETL, you want both.

Schema on read is good for raw data stages. You want to get the data from external systems in the database no matter what. You don't have a control over it and it can change any time. Just get it into a blob, variant or text column in a raw table and then transform it to your schema later. This way you can implement validation logic for the raw data within your database, instead of doing that outside in your ETL system. So if you believe in ELT, just get the data in first.

Of course you don't want to use stringly typed tables for any serious data analysis. You need to type the columns to make calculations perform and make the schema stable over data history. Instead of doing your typing logic, converting string and variant columns to native database types, in an ETL script, you do it in a the data warehouse.

The transformation into types has to happen somewhere and if you skip schema on read, that only means you never adopted ELT and are stuck with your database AND an ETL system. Adopting ELT would mean you'd only use one system, the data warehouse to do both.

riwsky · on Aug 22, 2021

That is arguably why it’s called “in formation” in the first place

isbwkisbakadqv · on Aug 25, 2021

this doesn't sound right...but I don't know enough about etymology to dispute it

offsky · on Aug 22, 2021

I’m bookmarking this thread for later. So much useful information here that can help me in the future :)

InGoldAndGreen · on Aug 21, 2021

This is super interesting! I've always had a strong preference for schema on write, both in databases and in life - prefer to organize the cupboards when I first set up house, rather than dumping everything in and hoping for the best. So I'm definitely very inclined to immediately accept your basic premise: schema on write is substantially better.

But I've ended up in several headed discussions on this, in the context of SQL vs NoSQL. The one argument I've been slightly persuaded by is: schema-on-read is significantly more flexible than schema-on-write. In most of my actual programming applications I still use SQL, because in coding I think it's better to prioritize planning and structure over flexibility.

If there's one area where flexibility is necessary, it's real life. When I first start researching something new, I don't usually have enough knowledge to actually structure my schema effectively, and the usefulness degrades. Of course, you can update the schema. This isn't always great. If your schema needs to change constantly, it wastes a lot of time. Depending on complexity, it can also just be a massive cost. Eg: when I have the kitchen set up all nicely, but then we get a new blender with five different attachments and now I need to find an empty shelf for them. Took about an hour to reshuffle everything satisfactorily.

All that said, I'd still say that schema-on-write is better than schema-on-read. Some structure is typically always better than no structure.

However, I've recently been reading a book that I think gives an interesting different insight to this problem - Designing Data-Intensive Applications, by Martin Kleppmann. I've always considered the main categorization of databases to be schema-on-write vs schema-on-read, but this gives a completely different method: databases are either document-based, relational, or graph. Relational databases we're all generally familiar with, while document-based is similar to today's NoSQL.

Graph databases have fallen out of favour, but I actually think that they might be the best at representing the human information gathering process. They have a structure that's provided on write, but isn't always consistent across entries - because it's flexible, and can be added to very easily. This lets us expand the schema as we gather more information and our view of the world changes, without needing to rearrange our past knowledge. I also feel like a graphlike structure better represents how we think.

Honestly, the main useful point I got from the book is that schema-on-write doesn't need to mean lack of flexibility. That is the case with most of the RDSs we use today, so it's what I've come to expect. But that shows a lack of imagination on my part, rather than any inherent restriction.

Timothycquinn · on Aug 21, 2021

I agree completely about Graph DB's. I used a third party Graph database that was schema heavy on Nodes and edges and it was amazing for mapping business processes as the customers can actually understand the data model and we did not need any DBA's. I wrote several custom enterprise systems based on this graph backend that were very successful in their deployment and upkeep.

The company that has this DB never sold it on its own and its only available with their $$$$ CAD / CAM / 3D / PDM / PLM software packages used by the big guys (E.g. Boeing, Honeywell ...)

hrishi · on Aug 22, 2021

Great book! Graph databases are useful, but I'm not sure if they're still a great interface for a direct human interface. Problem is listing links and finding connections is tiring for us to do.

That said, with enough automation and context-aware software, graph dbs start looking like a great idea.

SQL vs NoSQL is a perfect example of schema-on-read vs schema-on-write, even though with jsonb in postgres and more templates in Mongo you can use both either way.