Another way to look at it is the functional core, imperative shell pattern.
Wrapping up your dict in a value object (dataclass or whatever that is in you language) early on means you handle the ugly stuff first. Parse don't validate. Resist the temptation of optional fields. Is there really anything you can do if the field is null? No, then don't make it optional. Let it crash early on. Clearly define you data.
If you have put your data in a neat value objects you know what is in it. You know the types. You know all required fields are there. You will be so much happier. No checking for null throughout the code, no checking for empty strings. You can just focus on the business logic.
Seriously so much suffering can be avoided by just following this pattern.
I like to think about this as securing the perimeter. Inside, everything is typed, static analysis constrains what can happen, and I am never surprised as long as the code type checks. Outside, data is probably garbage. All the effort goes into locking down the interface. Pydantic is ok for this, although I find it too intrusive for my taste, and I think mixing arbitrary validity predicates with structural correctness is a mistake. Still, I’d much rather walk into a codebase that uses Pydantic than one that assumes its inputs are valid, because confidently writing business logic that can assume its inputs are correct is incredibly liberating.
The "loosey-goosey" approach to data in coding is one of my biggest pet peeves. Some people absolutely insist on making everything as dynamic as possible, and then wonder why we end up with a buggy mess. I always found it very natural to move as much as possible into the type system, because why wouldn't I want the machine to find all my inevitable mistakes for me?
Here’s an out-there take, but one I’ve held loosely for a long time and haven’t shed yet: dicts are not appropriate for what people mostly use them for, which is named access to member attributes.
dict is an implementation of a hash table. Hash table are designed for o(1) lookup of items. As such, they are arrays which are much bigger than the number of items they store, to allow hashing items into integers and sidestep collisions. They’re meant to act like an index that contains many records, not a single record.
A single record is more
like a tuple, except you want named access instead of, title = movie[0], release_year = movie[1], etc. And Python had that, in NamedTuple, but it was kinda magical and no one used it (shoutout Raymond Hettinger).
Granted, this rant is pretty much the meme with the guy explaining something to a brick wall, in that dicts are so firmly entrenched as the "record" type of choice in Python (but not so in other languages: struct, case class, etc. and JSON doesn’t just deserialize to a weak type but I digress).
NamedTuples are great, but they let you do too much with the objects. You probably don't want users of your GitHubRepo class to be able to do things like `repo[1]` or `for foo in repo`. Dataclasses have more constrained semantics, so I reach for them by default. In my ideal world they would default to frozen=True, kw_only=True, slots=True, but even without those they're a big improvement.
Dicts in python are for when you have a thing and you aren't sure what the keys are. Dataclasses are for when you have a thing and you're sure what the keys (attributes are). The trouble is when you have a thing and you're sort of sure, but not entirely sure, and some things are definitely there but not everything you might be thinking of.
I think most modern Python codebases are using dataclasses/ something like Pydantic. I think dicts are mostly seen, like the author suggests, because something which you hacked up to work quickly ends up turning into actual software and it's too much work refactor the types
dicts are used internally in the language to look up class and module attributes. They are optimized for this use case. How can it be wrong to use them that way when the very fabric of the language depends on it?
namedtuple is widely used in Python code, especially before the introduction of dataclasses.
A hash function will always be more expensive than a pointer lookup, specially concidering a pointer lookuo is still needed after the hash function.
No matter what you do, a lookup into an array will always be quicker than a hash lookup if you don't need to do a linear search, even in a lot of cases the linear search will be quicker.
Structs in other languages is a lookup of pointer + and offset. Which to my knowledge is also true in python classes using __slots__. There's no reason to use a dict if you know the contents of the data, use a dataclass with slots=True purely because there's no hash function run on every lookup into the datastructure.
It’s not wrong to use dicts, it’s just bad practice when you could use something like a dataclass or pydantic model instead.
Dicts are useful for looking things up, like if you have a list bunch of objects that you need to access and modify, you should use a dict.
If you are using the dict as a container like car={“make”:”honda”,”color”:”red”}, you should use a proper object like a class, dataclass, or pydantic model based on whether you need validation, type safety, etc. This drastically reduces bugs and code complexity, helps others reason about your code, gives you access to better tooling etc.
Subclassing NamedTuple is very ergonomic, and given they're immutable unlike data classes I often reach for them by default. I still use Pydantic when I want custom validation or when it ties into another lib like FastAPI.
And yet we are overwhelmed by javascript nonsense... I get it - it's so easy to get up to speed with tiny snippets but it quickly becomes hot mess.
Yes, decades ago I was also fascinated by python and it's ease of doing stuff (compiler doesn't complain that I missed something) but with time I grew fond of statically typed languages... they simply catch swaths of errors earlier...
I had to deal with it ~1-2 years ago so for my faily ancient self it feels "recent" though considering that there were JS frameworks popping up every couple of months and were getting dropped a year later then my timeframe may be off ;)
> Dynamic languages demand self discipline, they teach you to respect runtime and think ahead of execution time.
Ah... yes... because static languages doesn't do that by forcing you to properly model everything. And as a bonus you can easily navigate between everything and not fear that you miss something while refactoring...
For better or for worse, Python doesn't do typing well. I don't disagree that I prefer well defined types, but if that is your desire then I think Python is perhaps not the correct choice of language.
I tried using it, but beartype quickly became a pain with having to decorate things manually. Then I found typeguard which goes even further and never looked back. Instead of manually decorating each individual function, an import hook can be activated that automatically decorates any function with type annotation. Massive QoL improvement. I have it set to only activate during testing though as I'm unsure of the overhead.
Thats only because your list has different types. Its a badly formed API and if you really need to support that use case then you can use maps and reflection to handle it.
But the same issue exists as other dynamic languages, how do you know what the type is of the item you are accessing?
If you know the array will be laid out exactly like that before you make the request you can always create a custom parser to return a struct with those fields name what they actually are instead of arbitrary data.
The only valid way to parse that dynamically is to try and fail in a loop which is inefficient enough that you should stop using whatever API returns that monstrosity.
And if you got that JSON back in Python, how would you do anything with it? This API is essentially useless. You can deserisalise it, sure, but then what?
Can someone educate me in why dicts are uncool for explained reasons, but clojure (which seems to be highly recommended on hn) seems to suffer the same issues when dealing with a map as a parameter (ring request etc).
I know how to deal with missing values or variability in maps, and so do a lot of people.. what am I missing here?
Dicts are great when the data is uniform and dynamic, like an address book mapping names to contact info. You never assume that a key must be in there. Lookups can always fail. That's normal for this kind of use-case.
When the data is not uniform (different keys point to differently-typed values), and not as dynamic (maybe your data model evolves over time, but certain functions always expect certain keys to be present), a dict is like a cancer. Sure, it's simple at first, but wait until the same dict gets passed around to a hundred different functions instead of properly-typed parameters. I just quit my job tech at a company that shall remain nameless, partially because the gigantic Ruby codebase I was working on had a highly advanced form of this cancer, and at that point it was impossible to remove. You were never sure if the dict you're supplying to some function had all the necessary keys for the function it would eventually invoke 50 layers down the call stack. But, changing every single call-site would involve such a major refactor that everybody just kept defining their functions to accept these opaque mega-dicts. So many bugs resulted because of this. That was far from the only problem with that codebase, but it was a major recurring theme.
This should be the top answer. It's not about using dicts in their primary use case, it's about abusing them as a catch all variadic parameter for quick prototyping and "future expansion"
I think the problem is that different data containers have completely different interfaces.
If getting a filed of your object had the same syntax as getting a value from a dict you could easily replace dicts with smarter, more rigid types at any point.
My dream is a language that has the containers share as much interface as possible so you can easily swap them out according to your needs without changing most of the code that refers to them. Like easily swap dict for BTreeMap or Redis.
I think the closest is Scala but it fallen out of favor before I had a chance to know it.
Maps arent nearly as problematic in clojure because data is immutable by default on top of the functional paradigm where your program is basically a big composition of functions and the language is built around using maps. In Python I largely agree with the author. In clojure I love my maps.
Here is Rich Hickey with an extreme counter example although I would argue he's really demonstrating against getters and setters.
https://www.youtube.com/watch?v=aSEQfqNYNAc
In Clojure, maps don’t have either of the flaws highlighted in the article. They are neither opaque (they are self-describing with namespaces keys) nor mutable.
As a result, they are very powerful and simple to use.
The issue is that the concrete types are implicit. Depending on the language, runtime or type system expressing the type in a “better” way might be very hard or un-ergonomic.
I think one really nice thing about Python is duck typing. Your interfaces are rarely asking for a dict as much as they’re asking for a dict-like. It’s pretty great how often you can worry about this kind of problem at the appropriate time (now, later, never) without much pain.
There’s useful ideas in this post but I’d be careful not to throw the baby out with the bath water. Dicts are right there. There’s dict literals and dict comprehensions. Reach for more specific dict-likes when it really matters.
Duck typing is so fragile… Once you have implementations that are depending on your naming or property structure, you can’t update the model without breaking them all.
If you use a real type, you never have to worry about this.
If you use type checking, the breakage occurs when you introduce the change: the author of the change is the one who can figure out what it means if 'foo' is no longer being passed into this function.
If you're duck typing, you find this out in the best case when your unit tests exercise it, and in the worst case by a support call when that 1/1000 error handling path finally gets exercised in production.
Exactly… with strong typing, you can do the refactor automatically, because the IDE knows everywhere that symbol is used. (For codebases in your control—for third party users, you can indicate that something has been deprecated or renamed via a warning or other language feature)
Dicts can be a problem, but this particular example isn't that great, like in this diagram from the article:
External API <--dict--> Ser/De <--model--> Business Logic
Life's all great until "External API" adds a field that your model doesn't know about, it gets dropped when you deserialize it, and then when you send it back (or around somewhere else) it's missing a field.
There's config for this in Pydantic, but it's not the default, and isn't for most ser/de frameworks (TypeScript is a notable exception here).
Unfortunately some APIs assume that they will get all the fields as part of the update. If field doesn't exist in the input it gets it will drop the original value during the update.
I don't deal with external APIs often, but this is a development nightmare. You can't just magically let data flow through your system without knowing about it, because this is not how programming works. Your API has a contract and your code is written to support that contract, if the contract changes it should either be a very consciously decided breaking change that is versioned somehow, or it should be an unversioned non breaking change. Apparently whatever data is added like this is completely meaningless to your program so why do you need to be in charge of passing it back to the API.
Changing your API and assuming everything just keeps working is a nonsense cowboy attitude to software compatibility, even if some frameworks bend over backwards to support it through magic that's hidden from the developer. Furthermore, many programming languages are simply incapable of doing this, and this approach to APIs is immediately restricting those languages from use.
Finally, transforming objects to an internal domain model is really the cornerstone of a lot of recent well-thought-out programming discipline, and this API design is throwing that in the garbage. It's explicitly asking you to mess up your service architecture, spreading bad architecture like a virus to all systems that interact with the API.
In typescript using plain JS objects is very straightforward.
Of course you have to validate the schema at your system boundaries.
But you'll have to do this either way.
So: If this works very well in TS it can't be dicts themselves but must be the way they integrate into- and are handled in python.
This leads me to the conclusion that arguments presented in the article might be the wrong ones.
(But I still think, the conclusion the article arrives at is okay. But I don't think there's a strong case being made in the article about wether to prefer data classes or typed dicts.)
TypedDicts solve the linting problem, but refactoring tools haven't caught up (unlike e.g. ForwardRef type annotations, which are strings but can be transformed alongside type literals).
TypedDicts "aren't real" in the sense that they're a compile-time feature, so you're getting typing without any deserialization cost beyond the original JSON. Dataclasses and Pydantic models are slow to construct, so that's not nothing.
Pydantic v1 was slow enough for them to write a lot of the core logic in Rust for Pydantic v2, and for the previous sloth to have been an argument people launched against it if you look back at threads on here and Reddit comparing it to other libraries.
Pydantic's JSON parsing is faster than the built-in module, on par with orjson, but creating model instances and run-time type checking net out to be much slower. I linked msgspec's benchmarks in the previous post.
I generally support this. When dealing with API endpoints especially I like to wrap them in a class that ends up being. I also like having nested data structures as their own class sometimes too. Depends on complexity & need of course.
class GetThingResult
def initialize(json)
@json = json
end
# single thing
def thing_id
@json.dig('wrapper', 'metadata', 'id')
end
# multiple things
def history
@json['history'].map { |h| ThingHistory.new(h) }
end
... two dozen more things
end
Python has made its rise as an antithesis to Java thinking. Classes used to be seen by some in the community as an anti-pattern. [0]
The coding style used to focus on "Pythonic-ness," which meant using Python's expressiveness to write code in such a way that type information could be inferred without explicitly stating the type.
Most developers will carry their previous language paradigms into their new ones. But if types, DDD (Domain-Driven Design), and classes are what you're looking for, then Python isn't the best fit. Python doesn't have compiler features that work well with those paradigms, such as dead code removal/tree shaking.
However, starting out with dictionaries and then moving over to dataclasses is a great strategy.[1]
As a small note, it's kind of ironic that the statically typed language Go took inferred typing with their := operator, while there is now a movement in Python to write foo: str = "bar".
This has merit in some cases but let me try to make a counterpoint.
You lose the algebra of dict’s - and it’s a rich algebra to lose since in python it’s not just all the basic obvious stuff but it’s also powerful things like dict comprehensions and ordering guarantees (3.7+ only).
You tightly couple to a definition - in the simple GitHubRepository example this is unlikely to be problematic. In the real world, coupling like this[1] to objects trying to capture domain data with dynamic structures is regularly the stuff of nightmares.
The over-arching problem with the approach given is that it puts code above data. You take what could be a schema, inert data about inert data, and instead use code. But it might also be an interesting case to consider as a slippery slope - if you can put code concerns above data concerns then maybe soon you will see cases where code concerns rank higher than the users of your software?
[1] - by coupling like this I mean the “parse don’t validate” school of thought which says as soon as you get a blob of data from an external source, be it a file, a database or in this case a remote service, you immediately tie yourself to a rocket ship whose journey can see you explosively grow the number of types to accurately capture the information needed for every use case of the data. You could move this parsing operation to be local to the use case of the data (much better) rather than have it here at the entry point of the data to the system but often times (although not always) we can arrive at a simpler solution if we are clever enough to express it in a style that can easily be understood by a newbie to programming. That often means relying on the common algebra of core types rather than introducing your own types.
You also make a nightmare of dynamically adding middleware — which can piggyback on a generic dict and have no meaningful way to insert themselves into your type maze.
Python dataclasses are a good start for internal use. They are just a bit of a pain to serialize/deserialize natively. When it comes to that, I prefer to use Pydantic objects and have all the goodies, at the cost of some complexity.
I'm a big fan of using Protobuf for the third-party API validation task. After some slightly finniky initial schema definition (helped by things like json-to-proto.github.io), I can be sure the data I'm consuming from an external API is strongly typed, and the functions included in Protobuf which convert JSON to a Proto message instance blows up by default if there's an unexpected field in the API data it's consuming.
I use it to parse and validate incoming webhook data in my Python AWS Lambda functions, then re-use the protobuf types when I later ship the webhook data to our Flutter-based frontend. Adding extensions to the protobuf fields gives me a nice, structured way to add flags and metadata to different fields in the webhook message. For example, I can add table & column names to the protobuf message fields, and have them automatically be populated from the DB with some simple helper functions. Avoids me needing to write many lines of code that look like:
Knew it was python before the first line of code. Python lacks ceremony-free data syntax, that’s why people use dicts. Dataclasses have to be named, initialized and imported, which is tedious. Much easier to just foo({name, age}) and let typings match, but python doesn’t have that. Lack of “POPO” is a design mistake.
Yup! I find Elixir makes it really intuitive to know when to represent a collection as a map and when to use a list of tuples. And its easy to transform between the two when needed.
Yes, usually my APIs in Elixir receive their arguments as a well-typed map, not stringly keyed, and transform them to structs which the core business logic expects.
It's a bit of an odd article because the second part kind of shows why dicts aren't a problem. You basically just need to apply the most old school of OO doctrines: "recipients of messages are responsible for how they interpret them", and that's exactly what the author advocates when he talks about treating dict data akin to data over the wire, which is correct.
If you're programming correctly and take encapsulation seriously, then whatever shape incoming data in a dict has isn't something you should take an issue with, you just need to make sure if what you care about is in it (or not) and handle that within your own context appropriately.
Rich Hickey once gave a talk about something like this talking about maps in Clojure and I think he made the analogy of the DHL truck stopping at your door. You don't care what every package in the truck is, you just care if your package is in there. If some other data changes, which data always does, that's not your concern, you should be decoupled from it. It's just equivalent to how we program networked applications. There are no global semantics or guarantees on the state of data, there can't be because the world isn't in sync or static, there is no global state. There's actually another Hickey-ism along the lines of "program on the inside the same way you program on the outside". Dicts are cool, just make sure that you're always responsible for what you do with one.
I assume you're basically referring to this quote from the article?
"Ignore fields coming from the API if you don’t need them. Keep only those that you use."
IMO this addresses only one part of the problem, namely "sanitize your inputs".
But if you follow this, and therefore end up with a dict whose keys are known and always the same, using something "struct-like" (dataclasses, attrs, pydantic, ...) is just SO much more ergonomic :)
> Ignore fields coming from the API if you don’t need them. Keep only those that you use.
This is great if you know what you need from the start. If you only find out what you need after passing your data through multiple layers and modules of your system then you need to backtrack through all your code to the place of creation.
If you have immutable data structures then you have to backtrack through multiple places where your data is used from previous structures to create new ones to pass your additional data through all that.
So if your data travels through let's say 3 immutable types to reach the place you are working on then even if you know exactly where the new field that you need originates, you need to alter 3 types and 3 places where data is read from one type and crammed into another.
If you have a dict that you fill with all you got from the api there's zero work involved with getting the new piece of information that you thought you didn't need but you actually do. It's just there.
> convert [dicts] immediately to data structures providing semantics [...] You can simplify your work by employing a library that makes “better classes” for you
Python seems to have many different kinds of "better classes" - the article mentions `dataclass` and `TypedDict`, and AFAIK there are also two different kinds of named tuple (`collections.namedtuple` and `Typing.NamedTuple`).
What are the advantages of these "better classes" over traditional classes? How would you choose which of the four (or more?) kinds to use?
To me, the proliferation of "better classes" implies there's a problem with Python's built-in classes - but what's wrong? Are they just too flexible and/or too verbose? Or actually deficient in some way?
People enjoy the flexibility and many Python systems rely on duck-typing via dicts, etc.
So people are trying to force Python to be something it isn’t in adherence to their ideology — but it fails to gain consensus because there’s a sizable cohort that use Python because it isnt those things.
So we get repeated implementations, from each ideologically motivated group.
Hard disagree on most of this. The immutability dogma for one (changing data is "the worst felony you can commit to your data"). Computing IS manipulation and transformation of data. The contortions people go through to try and sidestep that seem delusional.
Plus all this 1995-era OOP and domain-driven-design crap, "business logic" and data layers and all this other architectural rigidity and usually-needless complexity, layers of boilerplate (and then tools to automate the generation of that), etc.
If your function takes a dict, and is called from many different places, document the dict format in the function comment. Or yes, create a dataclass if it saves more trouble than its additional boilerplate and code and maintenance causes. But take it case by case and aim for simplicity. Most of the time I call out to an API in python, I process its JSON/dict response right after the call, using maybe 10% of the data returned. That's so much cleaner and simpler than writing a whole Data Object Layer, to be used by my API Interface Layer, to talk to my Business Logic layer, etc.
I work with people who are ambivalent about this and believe using random dicts in a variety of places is a valid way to write Python code.
For these kinds of people, no amount of rational evidence or argument is going to convince them this is bad. They practically make an identity out of eschewing anything that seems too orderly or too designed.
(Luckily, at work, most of us on our team like `Pydantic` and also (some of us more than others) type-checking, so these people are dragged along)
In bioinformatics, one of our main dataflow platforms, Nextflow, is built with unnamed tuples in mind. Implementing the ability to conveniently pass data with HashMaps instead of unnamed tuples was a huge boost to usability for me.
i really want to go on a rant about the general state and historical choices regarding data formats and data structures in bioinformatics, plus all the wheel reinvention.
but i’m also trying to move on and do things differently today.
let’s just say the situation is displeasing and leave it at that.
I largely moved away from dictionaries when switching to swift. Usually we only use them now when going from legacy code.
For the example in the article with JSON, Swift has the Codable protocol, which cleaned all my client code for our back end from the old NSJSONSerialization code.
I don't agree with the predicate, but I have to admit that the rest of the article is well written to list the different ways to give types to dicts when it is needed.
If you want an example of a language where the exact opposite advice is taken at all times (with all the pitfalls described in this blog post), give Clojure a whirl.
This isn't arguing against them in general, but against the unfortunate Javascript-esque abandonment of specified semantics.
In particular, whenever anyone thinks that "deep clone vs shallow clone" is a meaningful distinction, that means their types are utterly void of meaning.
Another way to look at it is the functional core, imperative shell pattern.
Wrapping up your dict in a value object (dataclass or whatever that is in you language) early on means you handle the ugly stuff first. Parse don't validate. Resist the temptation of optional fields. Is there really anything you can do if the field is null? No, then don't make it optional. Let it crash early on. Clearly define you data.
If you have put your data in a neat value objects you know what is in it. You know the types. You know all required fields are there. You will be so much happier. No checking for null throughout the code, no checking for empty strings. You can just focus on the business logic.
Seriously so much suffering can be avoided by just following this pattern.