Hacker News new | past | comments | ask | show | jobs | submit login
Building and scaling Notion's data lake (notion.so)
239 points by alexzeitler 5 months ago | hide | past | favorite | 82 comments



Hi all—I'm the EM for the Search team at Notion, and I want to chime in to clear up one unfortunate misconception I've seen a few times in this thread.

Notion does not sell its users' data.

Instead, I want to expand on one of the first use-cases for the Notion data lake, which was by my team. This is an elaboration of the description in TFA under the heading "Use case support".

As is described there, Notion's block permissions are highly normalized at the source of truth. This is usually quite efficient and generally brings along all the benefits of normalization in application databases. However, we need to _denormalize_ all the permissions that relate to a specific document when we index it into our search index.

When we transactionally reindex a document "online", this is no problem. However, when we need to reindex an entire search cluster from scratch, loading every ancestor of each page in order to collect all of its permissions is far too expensive.

Thus, one of the primary needs that my team had from the new data lake is "tree traversal and permission data construction for each block". We rewrote our "offline" reindexer to read from the data lake instead of reading from RDS instances serving database snapshots. This allowed us to dramatically reduce the impact of iterating through every page when spinning up a new cluster (not to mention save a boatload in spinning up those ad-hoc RDS instances).

I hope this miniature deep dive gives a little bit more color on the uses of this data store—as it is emphatically _not_ to sell our users' data!


This is a fantastic post that explains a lot of the end product, but I'd love to hear more about the journey specifically on denormalizing permissions at Notion. Scaling out authorization logic like this is actually very under-documented in industry. Mind if I email you to chat?

Full disclosure: I'm a founder of authzed (W21), the company building SpiceDB, an open source project inspired by Google's internal scalable authorization system. We offer a product that streams changes to fully denormalized permissions for search engines to consume, but I'm not trying to pitch; you just don't often hear about other solutions built in this space!


Curious - what do you guys use for the T step of your ELT? With nested blocks 12 layers deep, I can imagine it gets complicated to try to de-normalize using regular SQL.

Have you explored a pattern like https://runtrellis.com or https://unstructured.io/ for unnesting?


Hey! While you're here...

> Iceberg and Delta Lake, on the other hand, weren’t optimized for our update-heavy workload when we considered them in 2022

Curious about your thoughts here. Have you followed Icebergs progress? Do you think it'd be a tougher decision in 2024 between Hudi and Iceberg?


Interesting! Now I'm curious how you handle live permission changes and indexes with stale permission data.


(I’m not on the search team, but I did write some search stuff back in 2019, explanation may be outdated)

The blocks (pages are a block) in Notion are a big tree, with your workspace at the root. Some attributes of blocks affect the search index of their recursive children, like permissions: granting access to a page grants access to its recursive child blocks.

When you change permissions, we kick off an online recursive reindex job for that page and its recursive subpages. While the job is running, the index has stale entries with outdated permissions.

When you search, we query the index for pages matching your query that you have to. Because the index permissions can be stale, we also reload the result set from Postgres and apply our normal online server-side permission checks to filter out pages you lost access to but that have stale permissions in the index.


Neat, thanks for sharing!


They didn’t say the quiet part out loud, which is almost certainly that the Fivetran and Snowflake bills for what they were doing were probably enormous and those were undoubtedly what got management’s attention about fixing this.


Found this comment (from Fivetran's CEO, so, with that in mind) regarding this article enlightening regarding the costs they were facing here https://twitter.com/frasergeorgew/status/1808326803796512865


Snowflake as destination is very very easy to work with on fivetran. Fivetran didn't have S3 as destination till late 2022. So it literally forces you to use one of BQ, Snowflake, redshift as destination. So fivetran CEO's defence is pretty stupid.


They weren't that quiet about it:

> Moving several large, crucial Postgres datasets (some of them tens of TB large) to data lake gave us a net savings of over a million dollars for 2022 and proportionally higher savings in 2023 and 2024.


I'd like to see more details. 10s of TB isn't that large -- why so expensive?


Fivetran charges by "monthly active rows", which quickly adds up when you have hundreds of millions to billions of rows that are constantly changing.

https://fivetran.com/docs/usage-based-pricing


yep, and Notion's data model is really bad for this pricing. Almost every line you type is a "block" which is a new row in their database.


They’re likely paying for egress from the databases as well.


DBA salaries, maybe?


Maybe cloud hosted


I thought the quiet part was that they are data mining their customer data (and disclosing it to multiple third parties) because it’s not E2EE and they can read everyone’s private and proprietary notes.

Otherwise, this is the perfect app for sharding/horizontal scalability. Your notes don’t need to be queried or joined with anyone else’s notes.


Also whether this data lake is worth the costs/effort. How does this data lake add value to the user experience? What is this “AI” stuff that this data lake enables?

For example, they mention search. But i imagine it is just searching only within your own docs. Which i presume should be fast and efficient if everything is sharded by user in Postgres.

The tech stuff is all fine and good, but if it adds no value, its just playing with technology for technology sakes


I too was surprised to read that they were syncing what reads, at a glance, to be their entire database into the data lake. IIUC the reason that Snowflake prioritizes inserts over updates is because you're supposed to stream events derived from your data, not the data itself.


This ^. This switch from managed to in house is a good example of only building when necessary.


They seem to be doing lots of work but I don’t understand what customer value this creates.

What does a backing data lake afford a Notion user that can’t be done in a similar product, like Obsidian?


The whole point of a data warehouse is that you can rapidly query a huge amount of data with ad hoc queries.

When your data is in Postgres, running an arbitrary query might take hours or days (or longer). Postgres does very poorly for queries that read huge amounts of data when there's no preexisting index (and you're not going to be building one-off indexes for ad hoc queries—that defeats the point). A data warehouse is slower for basic queries but substantially faster for queries that run against terabytes or petabytes of data.

I can imagine some use cases at Notion:

- You want to know the most popular syntax highlighting languages

- You're searching for data corruption, where blocks form a cycle

- You're looking for users who are committing fraud or abuse (like using bots in violation of your tos)


From the article: "Unlock AI, Search, and other product use cases that require denormalized data"


1st paragraph: "Managing this rapid growth while meeting the ever-increasing data demands of critical product and analytics use cases, especially our recent Notion AI features, meant building and scaling Notion’s data lake."


Beyond the features that the sibling comment mentioned, this kind of data isn’t really for end users. It’s a way that you can package it up, “anonymize” it, and sell the data to interested parties.


For someone like Notion, they probably aren't selling this data. The primary use case is internally for analysis (eg product usage, business analysis, etc).

It can also be used to train AI models, of course.


That "probably" is doing a lot of heavy lifting. That said, whether they sell it or not, it's all that data that is their primary value store at the moment. They will either go public or sell, eventually. If they go public, it'll likely be similar to Dropbox; a single fairly successful product, but failing attempts to diversify.


"Selling" is a load-bearing word, too. They're probably not literally selling SQL dumps for hard cash. But there are many ways of indirectly selling data, that are almost equivalent to trading database dumps, but indirect enough that the company can say they're not selling data, and be technically correct.


Is that why they’re putting images in Postgres? I don’t understand that design decision yet.


Notion employee here. We don't put images themselves in Postgres- we use s3 to store them. The article is referring to image blocks, which are effectively pointers to the image.


I... Don't think they are? If you look at the URL for images in notion, you can see the S3 hostname.


> Data lake > Data warehouse

These aren't something I would like to hear if I'm still using Notion. It's very bold to publish something like this on their own website.


Those are just different words for "database". What do you care what kind of database your Notion data is sitting in?


A "data lake" strongly suggests there's lot of information the company needs to aggregate and process globally, which should very much not be the case with a semi-private rich notebook product.


They literally explained in the article why they have a data lake instead of just a data warehouse: their data model means it's slow and expensive to ingest that data into the warehouse from Postgres. The data lake is serving the same functions that the data warehouse did, but now that the volume of data has exceeded what the warehouse can handle, the data lake fills that gap.

I wrote another comment about why you'd need this in the first place:

https://news.ycombinator.com/item?id=40961622

Frankly the argument "they shouldn't need to query the data in their system" is kind of silly. If you don't want your data processed for the features and services the company offers, don't use them.


> Frankly the argument "they shouldn't need to query the data in their system" is kind of silly.

Neutral party here: that's not what they said.

A) Quotes shouldn't be there.

B) Heuristic I've started applying to my comments: if I'm tempted to "quote" something that isn't a quote, it means I don't fully understand what they mean and should ask a question. This dovetails nicely with the spirit of HN's "come with curiosity"

It is disquieting because:

A) This are very much ill-defined terms (what, exactly, is data lake, vs. data warehouse, vs. database?), and as far as I've had to understand this stuff, and a quick spot check of Google shows, it's about making it so you're accumulating more data in one place.

B) This is antithetical to a consumer's desired approach to data, which will described parodically as: stored individually, on one computer, behind 3 locked doors and 20 layers of encryption.


At the scale of Notion, with millions of users, they’d have that much data.

I’ve seen 100TB+ workloads at smaller companies. Not unusual.


The concern isn't the scale, it's the use. What is there to _process_ when they're supposed to only store and retrieve to show to users?


The data doesn't have to be the content of user's notes. Think of all the metadata they're likely collecting per user/notebook/interaction – the data's likely useful for things like flagging security events, calculating the graph of interconnected notes, indexing hashed content for search (or AI embeddings?) ... these are just a few use-cases that come to mind from the top of my head.


Of which security and stability seems like the only reasonable use cases. Indexing content for search globally? Embeddings? They just can't help themselves, can they? All that juicy data, can't possibly leave it alone.


Great, you build only store and retrieve functionality. How:

1. Do you identify which types of content your users use the most?

2. Do you find users who are abusing your system?

3. Do you load and process data (even on a customer by customer basis) to fine tune models for the QA service that you offer as an optional upgrade? Especially when there could be gigabytes of data for a single customer

4. Identify corrupt data caused by a bug in your code that saves data to the db? You're not doing a full table scan over hundreds of billions of records across almost 500 logical shares in your production fleet

These are just the examples I came up with off the dome. The job of the business is to operate on the data. If you can't even query it, you can't operate on it. Running a business is far more than just being a dumb CRUD API.


Fwiw, you should able to answer #1 and #2 without hitting the main db if you've got good observability into your system.


Observability data comes from a drumroll database! Most analytics products that can answer these questions are just time series data warehouses.


a database, obviously, but are you really storing metrics and logs next to customer data in the same database, or did you skip over the part where I used the word “main”?


Could you expand on this?


What's there to expand on? Do you not realize how bad of a look it is for a company to publicly admit, on their own blog, the amount of time and engineering effort they spent to package up, move, analyze, and sell all their customer's private data?

This is why laws like CCPA "do not sell my personal information" exist, which I certainly hope Notion is abiding by, otherwise they'll have lawyers knocking on their door soon.


Where do they say they sell it? Citation needed; that's a legal and reputational minefield that I don't think they would admit to, like you said.


I would challenge you to find any broker who sells data (like the T-Mobile location data scandal) who says plainly and clearly they sell user data.


This is not answering the question.


Right, yes, tone aside that’s very helpful- at first I didn’t understand the implication of the blog post for implementing customer hostile solutions, but you’ve helped me understand it now.


That’s definitely something you want to do. Datalake can be home for raw and lightly refined data in an “analytics” database such as big query or just raw parquets. This is fast for large queries but slow for small queries. So you want refined data in a “regular” database like Postgres or mssql to serve all the dashboards.


Given how infuriating their implementation is of an in-app database, perhaps it's not that surprising.


This was a nice read, interesting to see how far Postgres (largely alone) can get you.

Also we see how at self hosting within a startup can make perfect sense. :)

Devops that abstract away things in some cases to the cloud might just add to architectural and technical debt later, without the history of learning from working through the challenges

Still, it might have been a great opportunity to figure out offline first use of notion.

I have been forced to use anytype instead of notion for the offline first reason. Time to checkout to learn how they handle storage from the source code.


> Managing this rapid growth while meeting the ever-increasing data demands of critical product and analytics use cases, especially our recent Notion AI features, meant building and scaling Notion’s data lake.

Are they using this new data lake to train new AI models on?

Or has Notion signed a deal with another LLM provider to provide customer data as a source for training data?


(I work at Notion, but not on the data platform team)

We do not and will never sell customer data to anyone. We do not train AI models on customer data. As we state in our privacy policy for AI features (https://www.notion.so/notion/Notion-AI-Supplementary-Terms-f...):

> Notion does not use your Customer Data or permit others to use your Customer Data to train the machine learning models used to provide Notion AI Writing Suite or Notion AI Q&A [added: our AI features]. Your use of Notion AI Writing Suite or Notion AI Q&A does not grant Notion any right or license to your Customer Data to train our machine learning models.

We do use various data infrastructure, including Postgres and the data lake, to index customer content both with traditional search infrastructure like Elasticsearch, as well as AI-based embedding search like Pinecone. We do this so you can search your own content when you're using Notion.

We wrote this article to explain how Notion's AI features works with customer data: https://www.notion.so/help/notion-ai-security-practices


It’s not a direct answer but from what Notion tell us about their own business:

* The team are based in the US, specifically California, and Notion Labs, Inc is a Delaware corporation.

* Their investment comes from Venture Capital and individual wealth. The investors are listed on Notion’s about page and are open about how they themselves became rich through VC funded tech companies.

There is a very open sense of panic in tech right now to climb to the top of the AI pile and not get crushed underneath. I would be amazed if there were any companies not enthralled by — and either already embracing or planning to embrace — the data-mining AI gold rush.

Notion is a great product but one would be naive to use it while also harboring concerns about data privacy.


This is one of the best blog posts I've seen that showcase the UPDATE-heavy, "surface data lakes data to users" type of workload.

At ParadeDB, we're seeing more and more users want to maintain the Postgres interface while offloading data to S3 for cost and scalability reasons, which was the main reason behind the creation of pg_lakehouse.


I'm not familiar with S3 on datalake setup. When replicating a db table to S3, what format will be used?

And I'm wondering if it's possible to update the S3 files to reflect latest incoming changes on the db table?


The file format is often Parquet. The "table format" depend on what data lake you're using (e.g. Iceberg, Delta, etc.).

If you know Python, here's[0] a practical example of how Iceberg works.

0 - https://www.definite.app/blog/iceberg-query-engine


Great article, thank you for sharing! I have a question I’d like to discuss with the author. Spark SQL is a great product and works perfectly for batch processing tasks. However, for handling ad hoc query tasks or more interactive data analysis tasks, Spark SQL might have some performance issues. If you have such workloads, I suggest trying data lake query engines like Trino or StarRocks, which offer faster speeds and a better query experience.


(Notion employee)

AWS Athena packages Trino, I’ve been using it for some queries like “find all blocks that contain @-mentions”. It’s a great tool.


Side-ish note, I really enjoyed a submission on Bufstream recently, a Kafka mq replacement. One of the things they mentioned is that they are working on building in Iceberg materialization, so Bufstream can automatically handle building a big analytics data lake out of incoming data. It feels like that could potentially tackle a bunch of the stack here. https://buf.build/blog/bufstream-kafka-lower-cost https://news.ycombinator.com/item?id=40919279

Versus what Notion is doing:

> We ingest incrementally updated data from Postgres to Kafka using Debezium CDC connectors, then use Apache Hudi, an open-source data processing and storage framework, to write these updates from Kafka to S3.

Feels like it would work about the same with Bufstream, replacing both Kafka & Hudi. I've heard great things about Hudi but it does seem to have significantly less adoption so far.


Is there any advantage to having both a Data Lake setup as well as Snowflake. Why would one also want Snowflake after doing such an extensive data lake setup?


Many BI / analytics tools don't have great support for Data Lakes, so part of the reason could be supporting those tools (e.g. they still load some of their data to snowflake to power BI / dashboards)


We've solved that issue with Trino. Superset and a lot of other BI tools support connection to it and it's a very cost efficient engine (compared to DWH solutions). Another way to go even cheaper is using Athena, if you're on AWS.


Athena packages Trino - it’s in part a managed Trino service.


They are several versions behind, support for delta was added just recently. Also consider that with Trino you can build a cache layer on Alluxio, making it really fast (especially on NVMe disks).


Saving money 100% also lower latency on distributed access. Accessing file partitioned S3 doesn’t require to spin a warehouse and wait for your query to go on a queue, so if every job runs in like k8s you don’t have to manage resources and auto scale in snowflake is a “paid feature”

I believe just not having to handle a query queue system is already.


For one, Snowflake is expensive (you pay for the convenience and simplicity) and the data in there is usually stored in S3 buckets that Snowflake owns (and they dont pass along any discounts that they get from AWS for the cost of that storage).


What's the difference between a data lake and a database with a filesystem?


I thought data lake was more of an advertising phrase, to show off how much data you have/can handle


Web scale


Add data warehouse to this list too.


Normalisation, relations, the filesystem, ..? I'm more curious about your view of a 'data lake' which is met by adding a filesystem to a db?


Normalisation, relations,

What does that mean?


OK, thanks, when E2EE ?


> Iceberg and Delta Lake, on the other hand, weren’t optimized for our update-heavy workload when we considered them in 2022

"when we considered them in 2022" is significant here because both Iceberg and Delta Lake have made rapid progress since then. I talk to a lot of companies making this decision and the consensus is swinging towards Iceberg. If they're already heavy Databricks users, then Delta is the obvious choice.

For anyone that missed it, Databricks acquired Tabular[0] (which was founded by the creators of Iceberg). The public facing story is that both projects will continue independently and I really hope that's true.

Shameless plug: this is the same infrastructure we're using at Definite[1] and we're betting a lot of companies want a setup like this, but can't afford to build it themselves. It's radically cheaper then the standard Snowflake + Fivetran + Looker stack and works day one. A lot of companies just want dashboards and it's pretty ridiculous the hoops you need to jump thru to get them running.

We use iceberg for storage, duckdb as a query engine, a few open source projects for ETL and built a frontend to manage it all and create dashboards.

0 - https://www.definite.app/blog/databricks-tabular-acquisition

1 - https://www.youtube.com/watch?v=7FAJLc3k2Fo


[flagged]


what do you dislike specifically? They've invested a lot of energy into it when last I checked


They advertise markdown support, the ability to export markdown, and the ability to import markdown...

However, what they don't say, is that the export and import format aren't compatible and are different subsets of markdown with different features.

If I export a notion page as markdown, then re-import that same markdown document back into notion, I get something wildly different.

All I want is to not use the notion editor (which lags and sometimes crashes my browser), and to instead use my local text editor which has served me well for everything else.

Failing at that, I want to edit plain text, like I can in github comments or wikipedia pages. Like, the fact that if I write 'foo`', and then go back and edit a backtick in before the word 'foo', it doesn't code-format it, but if I do it in the opposite order and type '`foo`', it code-formats it, makes it very clear I'm not editing text, and there is weird hidden state, which is annoying to reason about.

Just let me edit something like markdown directly, with an optional preview window somewhere, and that would _also_ be vastly better than the mess they have.


That's probably the problem. It's over engineered and just does things I didn't want or need it to do. I just want to type words and paste things in without having a bunch of bullshit happen.


Thank you for the clarification! It's great to hear more about the efficient data management practices at Notion. Your team's innovative use of the data lake to streamline the reindexing process while ensuring user data privacy is impressive. Keep up the excellent work!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: