The first approach (the 'It’s "obviously" the only way to go' one) is called an adjacency list.
The second (the 'vastly simpler method') i don't recall seeing before. It has some fairly obvious deficiencies, but it is clearly enough in some cases.
The third ('namespacing') is called a materialized path.
I love this comment. One of the worst feelings at my old job was putting a ton of effort into describing a problem, only for somebody else to recognize it as a known, studied thing that already has a name.
It’s really hard, in my experience, to come across the existing name for something while you’re still figuring out all the angles of a problem for yourself.
Of course, you still have to do all the work of describing the problem. But if you iterate through your problem and solutions within a conversation you can get to what you need (a conventionally understood term) faster than writing a fully-fledged blog post.
I copied the context of the article into ChatGPT (v4) and asked it "What are the names of these methods that are conventionally understood in the wider industry?"
It suggested terminology like: "adjacency list", "materialized path" and "nested set".
When I work on a problem that’s new to me, I ask around, explaining the problem, checking if someone recognizes the domain, if it is known.
When someone explains to me what I’m working on is a solved problem, I take great joy from it since I already understand the issue some, I can criticize or take great energy from learning something for sure.
I believe how to feel about it is a choice btw. As movie says “Always Look on the Bright Side of Life”.
That's generally how I do it as well. One of the positives about remote work is I can go on my general programming slack channel at my work and write out my problem and get better visibility than going around my local area in my office.
Also I end up getting to meet programmers from different projects/departments that I would have never met if I was only working in office.
> It’s really hard, in my experience, to come across the existing name for something while you’re still figuring out all the angles of a problem for yourself.
Absolutely! Discoverability is a huge challenge. There is so much amazing work that's been done in the past that is virtually impossible to find out about.
I only know this stuff because I was working with databases fifteen years ago and an older colleague encouraged to me to learn more about it.
I don't know why you should feel bad about that. If anything, you should feel good that you independently recognized a problem so significant that it was given a name.
I assume most problems I face are solved, and that I just don't know shit. So I scramble to research the solutions, and I'm shocked when what seem to be widely-encountered problems are NOT, in fact solved.
One recent example I encountered was API definition. I was tasked to define a new API for my company's products, and celebrated when I discovered OpenAPI. I was a bit surprised to find that only the very latest version of the standard (3.1) was competent enough to be useful.
And despite 3.1 being ratified for years, today there are still no usable code-generation tools that support it. I wasted weeks studying and trying to fix various tools, after studying reams of redundant and conflicting documentation in different repositories... thinking it was my problem. No. It's just a hideously broken mess.
Today I'm dealing with the same thing in SwiftUI... and again have reached the conclusion that the programming paradigm it pushes has not been thought through. Its rushed and immature state shows in its kitchen sink full of overlapping and rapidly-deprecated approaches to problems that were solved in traditional application structures a decade and a half ago.
Just typing that out, I wonder if I failed to learn from my first example and wasted too much time on the second. But if you're a thorough person, you have to satisfy yourself that you've been diligent in trying to inform yourself of best practices.
I think we’ve all had that one friend/classmate in c plus plus 101 who always “cried wolf” saying there’s a problem with the compiler making the instructor proclaim that 99.9999% of the time if your code fails to run it is not a bug in the compiler. We’ve been taught/trained to find the fault in our own work.
Which, to be fair, is really the correct thing to teach since in the vast majority of cases that's where the fault is!
On the other hand, I find a lot of bugs. Not in things as refined as compilers, usually, but if I had a dollar for every time I've heard, "Well, nobody else has reported this before," I could probably buy a tank of gas. And that's saying something today.
Latest example: Amazon's Web site rejects all phone numbers entered on an Apple Silicon Mac. I was setting up a new address, and it said "remove invalid characters from phone number field:" https://i.imgur.com/mjwiCqc.png
This happened in both Safari and Firefox. Amazon support couldn't figure it out in 1.5 hours of chat. I went to an Intel Mac and entered the exact same number, no problem.
That’s concerning. I’d understand iPhone or iPad or such sandboxed devices but not this. If you are so inclined, can you see if you can reproduce this on a simple html page? Do you think the problem is client side or on their servers?
I like neocities.org for small experiments like this but you can use anything you’d like.
Thinking it through, I assume it's on the client side. Maybe something is wrong with the JavaScript that's doing the validation... but do Safari and Firefox use the same engine?
I don't know what else could vary based on platform. One thing I can't control for is browser and OS versions, because the older Mac I have (the one that works in this case) can't be updated to the latest.
It must be sending garbage characters. But the vast, vast majority of sites work. It's really mystifying.
As far as other sites are concerned... I just remembered one that's not quite as clear-cut. Zoro.com complains that my credentials are wrong, but if I reset my password it complains that I'm trying to use my current password... so it's obviously not wrong. As usual, they just threw up their hands... and sent me a 10% coupon... which of course I can't use because their site is broken. Just tried it again, and it's still unusable. Didn't try the older computer yet.
On that note, the Online Encyclopedia of Integer Sequences [0] is a fantastic resource for certain algorithm-stuff where you want to figure out if mathematicians have already named something.
Every generation wants to reinvent the wheel, because otherwise they'd feel helpless and it's way more fun. Further than that I think that every generation wants to make their own mistakes, regardless of how well documented they were.
Overall this is healthy, a new set of eyes to an old problem can yield new solutions. In some cases though, it's a facepalm situation. Like the whole Javascript world of today.
That's certainly the reason I wrote my own static site generator for my website.
I knew I was reinventing the wheel but I also knew the learning results would be immense. I usually work in real-time systems and it was great to have a project to learn how to do offline data-in data-out types of problems. The two types of problems can be very different.
Also it gave me a good excuse to finally actually do something with a functional language.
I'd never recommend my static site generator to the general public though.
You nailed it. We hire younger grads who just want to cram everything into a nosql document and not give any thought to data modeling. Of course, they do all the logic of displaying the tree in code, which is sad as modern relational databases along with some CTE’s can handle so many use cases for free and in an elegant fashion.
IMO it is because SQL isn't a well liked/popular language (eg. the nosql movement) and it is a lot easier to avoid these days with all the various DBs. Personally I wish Prolog had gained more traction back in the day, it would have made a great alternative.
I'm thinking performance - it used to be a big deal to optimise database performance. Now most databases can live in memory so even if you're doing full table scans all the time it won't be that noticeable, and by the time the database has enough data in it that performance becomes an issue every one who built it has left.
The focus on micro services to, probably has an impact - the back end can be changed if your database is having issues and the services stay the same.
Also databases are hard and a large chunk of developers only know javascript and html.
You don't see many DBA jobs any more, a lot of that is moving to the cloud so they just pay more money and throw CPU's at the problem rather than optimise the database.
Also agile/scrum (as its practiced anyway) doesn't really allow for a database design up front as it used to be done.
Prevalence of abstraction layers (ORMs), document stores, key-value stores, etc.?
Someone somewhere decided they didn't like SQL and engineered a way to postpone having to know SQL until their ORM performance dragged their systems to a halt.
2. A move away from databases as store of all knowledge to databases as implementation detail of applications, which IMHO was a good thing, but tended to make them smaller, simpler and more hidden away
CREATE TABLE test (path ltree);
INSERT INTO test VALUES ('Top');
INSERT INTO test VALUES ('Top.Science');
INSERT INTO test VALUES ('Top.Science.Astronomy');
And then a simple search with:
ltreetest=> SELECT path FROM test WHERE path <@ 'Top.Science';
path
------------------------------------
Top.Science
Top.Science.Astronomy
Caveat programmer: one of the quirks of ltree is that intermediate paths - which would be parent nodes, when drawn as a tree - are not required to exist. In the example above, deleting the Top.Science record will not prune the Top.Science.Astronomy record.
So whilst the labels in ltree values do imply a logical tree, by describing materialized paths through it, they do not enforce the presence of records matching all the implied parent nodes.
Depending on your application this may be exactly what you wanted, or exactly what you didn’t want. In the latter case, you’re gonna need to put something around it to maintain integrity.
In case you ever work with an Anatomist, and they are visibly freaking out at you destroying everything it is because they order parts anterior to posterior and it is sooooooo obvious they wont ever feel the need to mention it ...
As far as I remember it isn’t that simple. Moving operations now require extra queries and reshuffling of order. Doable, just far less elegant for a very common use case.
My suggestion would be to write your tree using an approach that is friendly to whatever constraints you need to impose or whatever shuffling operations you'll need to perform, and then use a view or computed column to materialize your paths as an ltree. Then you can use the ltree and accompanying index type for efficient path queries, but you still have the power to manage the sorts of details you mention when you need them.
Put another way, ltrees are a denormalized representation; we denormalize to optimize lookups. Other approaches that use foreign keys and such are more normalized representations; we normalize to give ourselves the power to enforce constraints and to optimize update operations. But as long as we can trade some storage and write amplification for performance on lookups, we aren't tied to one approach or the other, we can mix and match.
The problem here is that most often the value in the structure is the hierarchy of the data and NOT the displayed tree. You will want to do stuff with the data like traverse it, show relationships, reorder it, etc.
IMO, it's a dangerous thing to hold VISUAL information in a data structure in a database. Seems shortsighted.
It's kind of a strange response when the author already explicitly said "people think they always need to formally encode parent-child relationships but sometimes they don't, sometimes nested display is all that's needed".
What's the response here? "No, that can't possibly be true"?
You Aren't Gonna Need It is a celebrated building heuristic for a reason. You Should Always Assume You Need It simply isn't true.
I worked with some YAGNI zealots and this was a routine experience. I adopted the philosophy of IYHTAYPGNI, "If You Have To Ask, You're Probably Gonna Need It".
There's YAGNI and then there's Suitable for Purpose. If you display a thing like a tree, then people are going to expect to do tree-like things to it, like collapse views and reorder items. You might as well just git gud at tree operations and save yourself the hassle once and for all of chasing down edge cases every time a new feature request comes in for things that should have been natural extensions but you omitted "cuz YAGNI".
Just because something has a tree structure, it needs to be able to support all tree operations? Why is that such a compelling argument to you? What if it just doesn't?
The thing is, once you learn how to do tree operations, they aren't that hard to do.
I'm not talking about learning how to balance a binary tree in the most efficient way possible. I'm talking about the basic ops of breadth- and depth-first traversals of arbitrary trees.
I feel like we live in an era of software development that took a stupid meme about "inverting a binary tree" as a job interview question and--not knowing anything about trees--turned it into received wisdom that "trees are hard".
But they just aren't. And yet so many people live in fear of them and go to great lengths to avoid having to learn how to use them.
It’s more like: if something has a tree structure, and the project scope is not 100% finalized, then more likely then not tree operations are going to eventually be requested. The cost savings from the hack is so low versus a standard tree representation that it’s unlikely to be worth the risk this occurs
The cost savings is low but the risk is high? How do you figure? It sounds like both options are easy (one is just even easier) so there's not much risk if we have to switch.
The irony is that they're still using parent IDs in their data. Instead of storing it in a dedicated column with an optimized data type, it's being prepended to the data string. It might not be a number, and it might not be in an ID column, but it remains an identifier that points to another (expected) value. The change of format doesn't disqualify it from being a parent ID.
You should be able to easily reconstruct the parent/child relationships from the OP's order/indent encoding scheme though (at least if you ensured that only "valid" indents are stored, no child without a parent, etc)
So I think the easiest way would be to store as order/indent first and later migrate to parent/child when you implement features that need it.
The only change I'd do is to define "indent" more abstractly as the item's depth in the tree and not literally the number of spaces you want to render. That makes it a lot easier to spot invalid data, makes later migration easier - and also gives you more UI flexibility, because it allows you to change the rendering method or make it configurable per user (nested <ul>'s/<ol>'s, tabs, 8 spaces, 4 spaces, 1 space, etc)
And your key has consistent path separators like a/b/c
Then looking up parents and children is incredibly easy. At worst it's a linear scan of the array, but if it's in order you can even just look at previous items in the list until you're at the parent.
I have recently been faced with a challenge where I need to arrange the top level hierarchy in descending order while simultaneously arranging the second level hierarchy in ascending order with using such structure.
I started a company that dealt with a lot of tree like data. It is possible to transform your tree structure into an indented list in O(n) time. This used to be one of our interview questions at the time. There are a number of ways to store your data in various SQL databases that allow you to quickly get and render segments of the tree as well without recursive queries.
Once you understand those concepts, then storing your data correctly as trees has a ton of benefits over indenting like this.
What people are saying in this comment section is that you're probably going to need it. You might not need it now, but the PM of today is a short-sighted person and the future always gets here.
Sure, but if your current solution is 20 lines of code it's trivial to refactor it later if you need it.
Of course the average dev won't do that, they'll just add a hacky workaround as they always do and end up with a buggy horrible mess but thats irrelevant. They'll do that anyway if they're that type of dev.
And honestly I despise this "managers are short sighted" excuse. We're the developers, we do the work. If I open a repo and see a horrible buggy mess with your name on it I'm judging you, I don't give a crap who your manager was.
"one way to get tree-structured data from a relational database with a SQL query is to write a recursive CTE (Common Table Expressions), which are just as fun as they sound."
I promise you that CTEs, even recursive ones, are not scary and that once you get the hang of them are actually fun.
I don't find CTEs particularly fun, copying and pasting the whole tower of CTEs into another SQL window to debug the one I'm interested in is decidedly not a form of entertainment I pursue.
You only need one recursive CTE typically and you can encode it into a view that you join against. It just sucks for performance when you have millions of nodes.
I found using recursive CTEs incredibly slow for assembling tree data from a normalized representation. It would make getting the results of a query at least d times slower for assembling the path of nodes d deep in the hierarchy. The upside was cheap tree edit operations but those were a lot less frequent than reads.
> I learned the hard way long ago: more often than not, people don’t actually want (or need) trees, they just need the appearance of them.
I've been noticing that this is one way in which HN and Reddit differ. On HN, a child comment is a nextSibling of the parent comment, with indent set to parent's indent plus one to simulate the appearance of trees. On Reddit (or at least old.reddit.com, don't know about the new site), a child comment is actually nested inside the parent comment.
That's definitely how I interpret it, as using the word nextSibling [0] in camel case is quite specific to DOM manipulation.
While "the way it looks" might be all that matters to most users, perhaps there are cases where semantic markup is critical (screen readers? browser extension development?).
I can't imagine it's stored this way in the backend. Every operation on the data would be a complicated mess of inferring the tree structure then translating back to the implied tree format.
Loop over next siblings and hide them one by one, until you hit an entry with an indentation level that is less than or equal to that of the current entry?
The idea behind the article is simple - use the structures fit for the problem.
The narrative, though, is imho faulty. You dont need CTE to retrieve the tree from the db - you can fetch a flat list and construct the tree locally, as you would most likely need to anyway for following manipulations.
The very same argument can be made about people who use rdbms to store the list - store it in a text file. Why pay the network latency cost?
Or alternatively, the proposed structure would not work with sufficiently big tree if we wanted to move branches around, changing their depth. That would impose linear cost.
State your intention at the beginning, instead of explaining three different examples that are later on invalidated in the conclusion section with "if you need a tree, use a tree". But adding this to the beginning of the article would be way less clickbaity.
I had a similar realization about OpenGL a bunch of years ago: I don't need to draw a world of hierachical 3D objects, I just need to draw a list of sorted triangles. That flipped a switch in my head and made a bunch of optimizations very easy.
True, simplicity goes a long way in post-2000 3d games. Even in games where there's complex entity hierarchies, it often has to be collapsed into a flat structure when added to render queue (for things like transparency sorting). "Flat lists of things" is also pretty much also the foundation of ECS/DOD.
Another way to make a fake tree is to store a JSON blob. If the data only has internal relationships, that can be easier than trying to keep the sort number unique and in order.
But what do you do if the child needs to refer to the parent, or the parent needs to know its children. An illusion of a tree is only useful insofar as you don't really care about it being a tree.
I structure things as trees because I want to be able to traverse them. When a user sees your tree and expects to be able to do tree operations on it, but discovers its just an illusion and they can do no such thing, then dissatisfaction sets in.
"Fake" is literally the first word of the title. If a fake version of something had all of the qualities of the real thing, then it wouldn't be fake. It would be real.
Also, it's mentioned at least three other times in the article before the conclusion. Just one example: "Do your items actually have to have a parent-child relationship, or do they just need to look like they do?" That's pretty clear.
The article describes fake trees thus Fake in the title, yes. The article also asks the question you quote, true.
But it also goes on with multiple examples that support their approach whilst neglecting storing data as real tree ("That sounds like a lot of work and potentially awkward-to-work-with data structure, especially if stored in a database" and "one way to get tree-structured data from a relational database with a SQL query is to write a recursive CTE (Common Table Expressions), which are just as fun as they sound"). You don't prove your point by using the most extremely examples of the "bad" approach.
The article is very vocal about shortcomings of storing trees as trees, but quiet about possible issues with the approach used.
It looks more like a emotional rant rather than an informative piece.
>But what do you do if the child needs to refer to the parent, or the parent needs to know its children.
In the namespacing example, it's trivial: the parent is the version of the name of the current node with one "namespace" less. E.g. /bar/boop/bleep's parent is /var/boop.
Similarly, the children of /bar are any nodes starting with "/bar/".
Funny, I'm using the same technique in my block editor[1]. Rather than using the complex table/tree-like data structure of QAbstractItemModel[2], I'm using QAbstractListModel[3] (flat list) with a property called indentLevel[4] for each delegate. It works out great. I still need to manage parent child relationships (it makes rendering changes easier using recursive function) so I keep variables for that, but the "fake tree" look, as I like to call it, simply uses the indentLevel property which is determined by the amount of whitespace/tabs at the start of a line in the underlying plaintext data.
This reminds me of the MPTT (Modified Pre-order Tree Traversal) method used in a lot of older SQL-based software, like phpBB (and no doubt still used today.) I won't attempt to explain exactly how it works, but it stores a tree structure in a way that makes many kinds of queries possible even when treating the table more or less like a flat list. It was very useful for being able to snag subtrees in relatively simple SQL queries.
Also, tangentially, I am reminded of the array representation of the binary heap structure. An array-based binary heap maintains (as far as I am aware) the same properties as a pointer-based binary heap, but the structure is entirely implicit based on index.
There's a lot of ways to skin a cat for sure. Reaching for pointer-based trees of objects is of course totally fine, but there's a lot of ways to normalize data, and if you have a decent idea of what you're looking for you can definitely save a lot of trouble.
This trick is particularly useful for rendering things that have somewhat of a hierarchy, but aren't implicitly hierarchal. Like a table of contents for a HTML page, targeting the various header (h1-h6) elements. They're not (generally) descendents of each other, and parsing them is simpler to just go as a list of items with a depth.
I used this for the table of contents on my personal site:
Although this post doesn't establish the premise sufficiently to comment on the data or its schema, I would venture that much of the time what you want is a GROUP BY query on columns of the table, rather than a column of pointers.
What distinguishes a Foo from a Bar? Are 1, 2, and 3 (or I, II, etc.) actual data values, or simply row numbers arising from a sort on something else?
If you're writing the back end for OmniOutliner, and you must use a relational database, then maybe you'll use tricks like these. Otherwise, I would see whether you can come up with a schema that better models your data.
I maintain an old cms that stores the id of the parent, left sibling, and right sibling.
Making any changes to the order or grouping is wrought with peril because you never know if everything is going to be updated properly. When it isn't, it's an ordeal to correct.
Having a parent id and sort order is straightforward and about as simple as it gets. I really don't see why you'd need to do it any other way outside of a thought experiment.
> They have a parent node, that really isn't describing a linked list.
Indeed - you're absolutely right.
Not sure if the comment I responded to was edited to add the parent node after my post or if I just overlooked it, but I should have quoted that I was just referring to the left and right siblings (i.e. two linked lists):
> an old cms that stores the id of the ... left sibling, and right sibling.
Tangentially related, a quick and dirty habit I've picked up over the years when explaining a taxonomy or other hierarchy is using my local filesystem.
A few mkdir's followed by the tree command. Paste it into email, slack, pastebin, etc and ship it off. Takes less than two seconds and gets your point across better than verbally explaining.
A more stupider solution if all you want to do with it is saving/loading it : serialize it however you want (e.g. a json) and save that as a string?
interesting note : using the dot naming is also how the note taking VsCode extension Dendron handles names hierarchy. The file foo.bar.md is considered a subnote of foo.md
Another underrated benefit is this is potentially a lot faster. A flat array, which you can use to store this, is a lot more cache friendly to a CPU than a data structure with a lot of pointers. (Although this mostly requires a language like C++ where you can control the memory layout)
I have tree logic in a c++/c/rust/javascript and have reused it several times for these UI tasks where we want a collapsible list. It's pretty easy to map most UI "tree" widgets to it as they mostly work the same.
I would be _very_, very cautious with this. The problem is precisely that the system presents itself in a way that implies more than it actually has going on. Users will naturally assume there is a real parent-child relationship in the underlying data model.
A soon as someone asks for <thing that requires the parent-child relationship that is implied to already be there>, then you have to admit that it's a smoke-and-mirrors trick and you can't actually achieve that with the data you have.
Avoid storing visual figments in your database, they'll come back to haunt you.
Nobody would store "Foo 1 a" as name of the deepest item, they would want to build the name dynamically based on the concatenation of the parents "Foo" and "1", in which case the system doesn't work at all.
Otherwise, most operations on the structure are impossible to perform (a simple rename in the middle of a branch).
The sorting is according to the "sort" column, not the "name" column (they are not explicit about this but it's clearly what they mean, and it solves your problem).
I think the benefit could mostly be for the implementation of the GUI to change the tree (that they do mention).
If you don't feel like using a hack, don't. (They make that clear, too.)
The second (the 'vastly simpler method') i don't recall seeing before. It has some fairly obvious deficiencies, but it is clearly enough in some cases.
The third ('namespacing') is called a materialized path.
And there is at least another way to represent trees - nested sets: https://www.ibase.ru/files/articles/programming/dbmstrees/sq...
All of these were well-trodden back in the days when people took relational databases seriously. For example, see: http://www.dbazine.com/oracle/or-articles/tropashko4/
It seems this is lost knowledge now.