I live in the Python bubble, so I haven't realized until this post that so few people knew about that.
This post is massively popular despite talking about a feature we had since Python 3.6, in 2016, that was posted on HN at the time and that is featured in most popular tutorials.
A good reminder that most of the world doesn't revolve about my favorite language. And that information is not that fast to spread.
> about a feature we had since Python 3.6, in 2016
No, you had that feature in one implementation. Now it's in the language specification. That's vastly different because only now you can rely on it without fearing it can go away with the next release.
First, cPython is 99% of Python deployments. So much that if a script works on it but not on another implementation, people often consider the later broken.
As this post proves that most people don't even know about this feature, you can be pretty sure the vast majority of people don't know about pypy, micropython, etc.
Secondly, even if you want to nit pick, Python 3.7 made it official more than one year ago. We are currently in the 3.9 alpha.
In fact, 3.7 didn't touch the implementation, just merely declared "yep, good idea, let's keep it that way".
There's also the indication that this is official, and not just an implementation quirk. I still used OrderedDict in 3.6 because there was no guarantee 3.7 (or later) wouldn't silently break ordering on the standard dict. At least now it's unlikely to change because it will be considered a breaking change.
BDFL declared Python dict to be ordered in 2017 — It can't be more official than that for Python https://mail.python.org/pipermail/python-dev/2017-December/1...
i.e., Python 3.7 (whatever implementation must keep the insertion order). CPython keeps the order since Python 3.6. Pypy even before that.
I knew about it, because everyone was talking about it.
The origination of the actual information is irrelevant. It was basically interesting gossip for a few weeks.
There's a great talk by David Beazley[0], which I don't remember its title right now, where he bashes all previous python versions except the latest at the time[1], which was 3.6. And he says "since I'm not a core developer, I can tell you this: rely on dicts being ordered so much that eventually they'll make it official". Guess he "won" in the end ;)
[0] Well, he usually gives great talks anyway
[1] Alright, he literally started many talks doing that
Raymond Hettinger does or did a few talks on dictionaries too. He covers the history of dictionaries in python and how they evolved. It was mentioned in the article's comment section.
I have lived in Python ~80% of my career (~50% now) currently), although a large chunk of my career was writing 2.7 compatible with 2.6 (thanks RHEL). For the last year and a half I have written 3.6+ exclusively and I didn't realize that regular dicts were now ordered.
I don't use dicts if I require ordering, but if I did I would probably still reach for collections.OrderedDict...
I wonder what the implications are of dict ordering with the "new" (new for me...) function args and *kwargs...from the hip it sounds pretty handy.
The implications are that now you get to see the order in which keyword args were specified by the caller. And in classes, you get to see the order in which attributes were defined. Both can come in very handy.
This is awesome, because the ordered map is the best data structure out there for easy & predictable programming, possibly only barring the array.
There's so many cases where it's a benefit for map entries to retain order (and none where it's a problem). PHP really got this one right (and immediately messed it up by mixing ordered maps with arrays into a big soup, but hey, PHP). And so did, JS, sorta-kinda-by-accident.¹
When I went from PHP to Ruby back in the day, Hashes not being ordered definitely was my biggest gotcha. Everything about Ruby (except the deployment) was nicer, but the hashes... I've spent serious amounts of extra thinking just to make code work that would've worked out of the box in PHP.
Yay for ordered maps! Every language should have them, they're the best! There's just so much ergonomics packed in such a simple API.
1) JS objects are ordered maps, iff the keys are strings that cannot be parsed as an integer (really) (yeah it's nuts) (but still better than unordered maps!!)
So currently (now that there are Symbols in JS), the order is: array-indexed properties first (integers), then string-key properties, in insertion order, then Symbol-keyed properties in insertion order.
The reason for the funny behavior in treating integer keys differently is that property keys are always treated as strings, so obj["3"] and obj[3] can't be distinguished, and arrays are also ordinary objects, so setting obj[3] was made to do the same thing on any object rather than special-casing arrays and non-array plain objects...
Today, JavaScript is a small, reasonably elegant scripting language with reasonable semantics, buried in a medium-sized, reasonably expressive scripting language with reasonable but different semantics, added 20 years later. The kitchen-sink disease is far enough along that it'll probably never recover the appeal it once had as a beginners' language. It still has the advantage of backwards compatibility, and it's become more practical as a compilation target.
> The reason for the funny behavior in treating integer keys differently is that property keys are always treated as strings, so obj["3"] and obj[3] can't be distinguished, and arrays are also ordinary objects, so setting obj[3] was made to do the same thing on any object rather than special-casing arrays and non-array plain objects...
Note that special-casing arrays is precisely what implementations used to do (i.e., for non-Array objects, the order in most VMs was simply "properties in insertion order").
V8 was the first JS VM to stop special-casing Array (and this was done before Chrome went public, and was the behaviour in the first Chrome beta). It turned out that virtually no websites relied on "array index" properties appearing in enumeration order (there was some breakage, but it was relatively few and far between, and believed to be worthwhile for the gains in cache-hits when accessing properties), and this also allowed the more compact array representations to be used for them.
I forgot about that when writing my comment but yes, V8 made that change along with "shadow classes" or whatever the object specialization stuff was called. Heady days for JS performance.
In retrospect if something was going to be standardized I'd have preferred it to be the older behavior which was simpler to explain, but so it goes.
SpiderMonkey actually had inline caches before V8 shipped if I'm not mistaken; the motivating factor for the behaviour change in V8 was just to improve the cache-hit ratio.
In any other language, the result of similar code would be "zero, one, two". You'd need to `ksort` this thing to get it to behave like a normal bog-standard boring array.
Everybody's writing PHP code pretending that it has arrays when really, it does not. The gotchas are way bigger than the gotchas in Python <3.7's unordered maps. It's a mess.
Why wouldn‘t you just push/append to the end of the array with '$arr[]'? There is no need for explicit keys in PHP. The gotcha is writing code in PHP pretending you are using a different language.
I literally just learned "bog-standard" 10 minutes ago and now I see it here. And no, don't Baader-Meinhof me, I would have noticed this weird term in random text in the past.
The order inserted makes most sense in most cases. Hand crafting integer based array indexes out of order is rare. And as you say a simple ksort call guarantees sort order.
ksort($arr);
echo join($arr, ", ");
// one, two, three
The PHP documentation refers to them as "associative arrays", which is technically correct, but I agree I'm not sure how they could have conflated these concepts so badly.
I believe the "associative array" terminology came from Perl. Though Perl's associative arrays don't have any guaranteed order. And Perl also has normal arrays.
>You start with an _empty_ array, then (somehow) set the third element of that array? That makes very little sense.
Well, there's no reason a language could not initialize an empty array to some initial capacity, either on declaration or when you add an indexed element.
doesn't appear to be initializing to any capacity. So that after:
$arr[2] = "two"
I would expect:
count($arr) === 1
And honestly, sure. A language could be designed to pad the array with empty values if you specify an index greater than the upper bound (or lesser than the lower bound). I believe a sibling mentioned matlab did something like that.
But is that really better behavior or just different behavior? I don't really see an important difference. My point was that the criticism was fairly weak. PHP, indeed, has a very useful abstraction in its concept of "array". Most of the hate should be directed at the name, not the construct.
In that situation, Matlab allocates an array of length 3, fills it with “empty” values (depending on the type), and then sets the third element to the value. That’s what I’d have expected to happen here too...
In practice, it's almost never an issue. The standard lib has push/pop/shift/pad methods available.
So example, if you had numeric indices that you wanted to guarantee that index order is equal to insertion order, you'd create an array using $my_array = array_pad([], 20, null); If it's an array that you didn't create, the ksort(...) method will order the array for you as expected.
>There's so many cases where it's a benefit for map entries to retain order (and none where it's a problem)'
Don't know if this could actually be the case in practice, but theoretically the ordering could allow for a timing attack to glean some bit of information when performing a linear scan of the map (size of the map, relative location of the data, etc).
Just a contrarian thought given the definitive statement of "none where it's a problem". I'm generally of the same opinion as you though; mostly if not entirely harmless, potentially helpful in certain applications.
>Don't know if this could actually be the case in practice, but theoretically the ordering could allow for a timing attack to glean some bit of information when performing a linear scan of the map (size of the map, relative location of the data, etc).
An attack where the attacker has access to your ...program code and can run instructions there?
In that case, leaks from a "timing attack" would be the least of your worries...
If they just provide an input to your program somehow externally, then whether you put that input into a map or an ordered map or not is an implementation detail. You could make your program rid of the "timing attack" in 100s of ways... (or have one, in 100s of ways). That doesn't make an ordered map more unsafe than any of the 1000s of ways to have a timing attack.
I think it's also worth noting that requiring an ordering is preventing a performance optimization
This doesn't impact most high level python code, and an OrderedDict is a very reasonable default. But there's a reason why Google's c++ map intentionally randomizes iteration order
(Hint: it allows the hash map and hash function to be extremely high performance while allowing themselves the flexibility to change the hash function)
The title of the post is misleading: it means "ordered by insertion order" (not sorted).
Previously it was unpredictable, and that's why collections have OrderedDict; which makes sense instead of trusting an implementation detail of CPython from 3.6.
I wonder why OrderedDict isn't outright deprecated (or reimplemented as a trivial wrapper over dict).
You can get the first/last key of a 3.8 dict with dict.keys() and reversed(dict.keys()) which you can then delete to reimplement OrderedDict.popitem.
Deleting a key and reinserting it will let you reimplement move_to_end.
The only cool thing that OrderedDict's implementation might be useful for is to move to front (or anywhere else not the back) but it doesn't expose that api.
When OrderedDict is compared with another OrderedDict, it also checks whether the order of items is the same (this way also violating LSP, but that's another story), dict instances don't do that.
It's not the same thing. Compare two OrderedDicts that weren't constructed in the same order and the comparison will fail. Compare two dicts that weren't constructed in the same order and the comparison will pass.
The difference is whether order is part of its identity.
A dict is still just a set of pairs, not a list of them. It just happens that it also guarantees now that if you _iterate_ over the set you'll walk the keys in insertion order.
Good point, but close enough! My points about how it makes the data structure more awesome stands.
FWIW I don't think people agree on whether "element order is part of the value's identity" is part of the definition of "ordered map". There's plenty other languages & libs with ordered maps where the order isn't part of its identity but they still use the term "ordered map" in the name or the docs. Eg PHP's arrays or ImmutableJS's OrderedMap.
Could you explain some examples of what this is useful for? What sort of algorithms or operations do you have in mind where you both want insertion order, and also key-based lookup in the same data structure?
I've made heavy use of all kinds of maps, and of queues and channels and arrays, but I don't recall ever noticing a situation where I wanted the properties of both mixed into the same data structure.
I'd love to learn more about useful tools to add to my toolbox!
For "algorithms", I mean, some really do care about insertion order. Like idk, if you have a priority queue, then you generally want FIFO ordering when the priority is the same? It like a pretty obvious desire in most cases... imagine a thread scheduler or a packet scheduler or what have you.
But generally it's about determinism and avoiding loss of information, not just whether a particular algorithm needs it. For example, you'd want serialize(obj) and serialize(deserialize(serialize(obj))) to produce the same output, otherwise you e.g. might not be able to cache stuf. But for a data structure like a hashtable, it's pretty tough (not logically impossible, but rather pointlessly difficult) to make that happen without preserving insertion order.
As another example, it's incredibly handy for a user to see items in the order in which they were inserted. Like say you're parsing command-line arguments, and the command is ./foo --x=y --w. If the user sees {x: y, w: None} then that tells them w was passed after x. That can be extremely useful for debugging; e.g. maybe you expected the caller to specify w earlier, in a different context and for an entirely different reason. Seeing that it came afterward immediately tells you something is wrong. But when such information is lost it's harder to debug code.
If --w can be specified in two places for "an entirely different reason", then an unordered data structure is simply inappropriate, full stop. That's not a question of debugging, that's a question of correctness.
I wrote a program that read in a proprietary data format, ran a few transformations, then outputted CSVs. Using an ordered map was super-useful, because it gave me deterministic output between runs. Otherwise I ended up with CSVs with differently ordered rows...
I think there is a lot of counters you might want both.
For example, supposed you have a list of ids accessing some system, you may want to count the raw number, but then also have some transformations thereof that are still ordered by count.
Example
l = getListOfAccessIds()
counts = Counter(l)
total = np.sum(counts.values())
fractions = {k:v/total for k,v in counts.most_common()}
#use the fact new dictionary is still ordered by most common
plt.semilogy(fractions.values())
plt.plot(np.cumsum(fractions.values()))
#still works like dict
print(fractions[id_of_interest])
Very few algorithms really care, rather, it's when you're composing them that maintaining insertion order avoids a lot of boilerplate in having to cart that information around.
For the same reason, though, it's potentially a bad idea. All these algorithms that generate dicts generally won't promise to maintain insertion order, they just happen to by chance. Then consumers come to depend on it and be surprised when it inevitably changes.
I'm pretty sure they didn't in Ruby ~1.8 though. It's been awhile :-) Can't remember exactly which version, but it bit me more than once so I'm pretty sure :)
Ruby hashes are currently ordered and have been for quite a while now. Ruby 1.9 maybe?
But back in the day they indeed were not, for appropriate values of back in the day. :)
it did make a LOT of things more convenient and less buggy when they made ruby hashes ordered, I recall. I didn't expect it would matter much, by found myself loving it. I believe the ruby maintainers investigated and determined they could do it with very little performance hit. It turns out that having a repeatable and predictable order, that also matches insertion order, is what a lot of people end up assuming whether they realize it or not, just makes everything smoother when it is.
> There's so many cases where it's a benefit for map entries to retain order (and none where it's a problem)
Of course there’s a trade off. There has to be.
In this case it prioritizes usage patterns oriented around insertion and enumeration over usage patterns involving repeated removal Or modification of items.
as items are removed, the insertion-ordered index will either fragment, require removal operations to not be constant time, or be implemented using a data structure that has per-item storage overhead (such as a linked list).
also there are two possible, reasonable things for an ordered list to do when you write a new value to an existing key - leave the key in original insertion position, or move it to the end. Whichever one the implementation chooses, people who want the other will need to implement their own key ordering state alongside that which the collection is doing for them.
In JS's defense, it really tried as well. Fortunately though they failed harder at it, which is why arrays and objects are probably more different from one another than Eich had intended there in that mad 2 week language design frenzy. Good for us!
Back when JS got popular and the majority of web programmers knew PHP, there were dozens of tutorials out there trying to convince you not to do stuff like this:
var a = new Array();
a["moo"] = "five";
a["foo"] = "six";
After all, well, it works! Like in PHP! So what's the problem? :-) Most people felt the same way about these tutorials as people feel about monad tutorials nowadays.
Am I the only one that thinks this is a stupid decision? This will silently break code that starts to rely on this behaviour that gets executed on Python3.5 and lower.
I would consider changing how a builtin works to be a major breaking change. It would have been fine if this was a change between 2 and 3 but on a minor version? Thats insane.
Code that assumed it was arbitrary, would expect to handle any arbitrary order, including a happens-to-be sorted order.
Code that assumed it was random, like actually inserted by random(), was already broken, because that simply isn't the case.
Code that assumed the order would stay constant was relying on implementation-specific behavior, and could potentially break on any version update; as with any reliance on implementation-specific behavior, you'd break if the dictionary code ever got touched -- even if it were for a bugfix.
Code that ordered the dictionary keys before iterating are now slightly innefficient due to extra work of sorting a sorted list.
Usually languages don’t have it in a very explicit way though. If I try to use f-strings in python 3.5 I get a very explicit syntax error. If I rely on ordered insertions in Python3.5 I get a potentially difficult to diagnose bug.
Why does it matter if it's explicit or not. If python doesn't support forward compatibility, you should know that if you write code for 3.7, it's not gonna work in 3.5. Doesn't seem like a big deal to me.
If you’re the only one writing it and you’re the only one running it, it’s probably fine. But if I’m putting a file out there that will only work in 3.7, it’d be nice if any potential users of that file would get a good error message if they try to run it on 3.5, rather than wrong results.
I could potentially assert a version, but do I really want to do that each time I wrote something that might be used somewhere else?
And yes of course I could add a line of documentation, but there is a 100% chance I’d still get bug reports from people on 3.5.
If I rely on a python version and I expect other people to use it, I add a version if statement on top. I hate those packaging tools that insist on installing stuff in your system and create a frankendebian when really all I want to do is run a single py file standalone once. Often have to do chenanigans like "python3 -c 'from sometool import __app__'".
If you want to install it, go ahead and copy or symlink it in your ~/bin or whatever you fancy (that's your personal preference anyway unless I'd specifically package it for some OS like Debian). I don't want to have to use some setup.py that I have no clue where in my OS it installs things.
> I don't know how this custom file works that is duplicated and delivered with each project.
It's not duplicated and in most cases it's not even delivered as part of the installation.
> If you want to install it, go ahead and copy or symlink it in your ~/bin or whatever you fancy
That's exactly what pip will do if invoked with `--user`.
> I don't want to have to use some setup.py that I have no clue where in my OS it installs things.
It installs it to a single place. Run `python3 -m site` and look at `USER_BASE`.
To avoid a lot of this, use pipx[1] to keep things even more isolated.
> Often have to do chenanigans like "python3 -c 'from sometool import __app__'".
You're doing things wrong because you don't know the tooling. You'd also typically just do `python3 -m sometool`.
Things that are distributed as a single file are either so simplistic and have no other dependencies that you can just make do, or written by someone who doesn't know what they are doing and so you're going to have a bad time.
Yes, that would be a good idea. But if you ever run scripts you didn’t write, there’s the potential people didn’t do this, and you have the potential for hard to discover bugs. The language should be designed such that bugs are difficult to encounter, this is an instance where it wasn’t.
Major version changes are for API compatibility. If there is a change which makes all other 3.* incompatible, then it should be a major version increase.
I suppose Python doesn't even have backwards compatibility within the same major release as we saw with the addition of the async keyword in Python3.5. Many older Python 3 packages broke because they expected that to be a legal identifier for a variable name.
tensorflow didn't work on 3.7 for a solid 8 months because some people at google very unwisely decided that `async` and `await` were great choices for variable names, despite PEP492 landing in 2015.
that's because tensorflow is advertisement for Google and while it's technically open-source, it doesn't stand for any kind of community-project, it's all there to show off (and ingrain in its users) the way Google wants things to Go (just look at the byzantine Bazel-build-processes - tensorflow taking hours to build and pytorch about 10minutes...).
But in the case of downgrading, I'm fairly sure there's a number of other breaking changes that can't trivially downgrade minor versions. Like f-strings were only introduced in python3.6 as I recall. Async keyword only exists as of 3.4 as well I think?
I think the argument is that if you run code with f-strings and walruses on Python 3.5, the code will break noisily. Whereas if your code implicitly relies on ordered dicts, it could break silently. Syntax errors rarely cause subtle, hard to track bugs.
Sure, but you can't safely take everything from a higher version to a lower version in any case; if insertion order became gauranteed due to a bugfix, and wasn't backported, you'd be in the same boat.
The only way to consistently code cross-version is to start with the lowest you plan to support (assuming the higher versions are actually backwards-compatible).
Does any language gaurantee that code is both backwards and forwards compatible?
Issue seems to be silent incorrect behavior, what happens if you attempt to run python code containing f-strings using an older python version. Does it raise an exception? That's good! What happens now if you write code for 3.7 which takes advantage of the new ordering and someone grabs it from your repo and runs it using 3.2, it would happily give incorrect results and noone is the wiser.
OrderedDict is slow and expensive though: it maintains ordering through a doubly linked list.
It has useful features for manipulating ordering but while I've regularly needed had use for maintaining insertion ordering I can't remember ever needing to move items around within a map.
If memory serves me correctly, ever since dicts became ordered the OrderedDict simply became a subclass of dict, so it will have exactly the same performance characteristics.
OrderedDict was always a subclass of dict maintaining additional information (which is not free, it has to store and manipulate two pointers per entry).
It remains so today, ordereddict is not an alias or trivial facade to the normal dict because it has to maintain its doubly linked list and implement a bunch of additional operations based on that e.g. pop first or move to end.
The trouble is I publish a (new) code that advertises itself as working on 3.x and then it turns out it is being used by a person who only had the version prior to this change.
That said, Go made a similar change (from insertion-order to explicitly-randomized) and world didn’t end. So there’s that.
I thought Go made the change from undefined behaviour with an underlying implementation that was insertion order in a map with 8 or fewer entries, to similarly undefined behaviour with an implementation that randomised lookups in those cases. Any code that has ever relied on any kind of ordering in Go maps will almost certainly be wrong, even random ordering, because the distribution of the "random" ordering is biased.
Yeah I think this is probably my main issue. I don't think it's reasonable to ask users of your code to always use 3.7+ instead of 3.6 if they are usually expected to be compatible. And it's also unnecessary to break such compatibility for something like preferring dict over OrderedDict anyways. At least I would try to avoid any such issues by still using OrderedDict.
That said, I have no idea about the internals of dict. I assume no performance was sacrificed for this change.
It actually improves performance. Or at least, it comes along with a set of performance improvements that give you ordering for free. Raymond Hettinger has a great talk on it: https://www.youtube.com/watch?v=npw4s1QTmPg&t=1s
Using OrderedDict is actually nice in this case, even if the default dict has the same ordering. That way you're explicitly saying you rely on that behaviour and it makes reading the code easier.
Python is nothing like its guiding principles. That's why it's Zen -- the principles are a collection of contradictory statements, given what you will encounter in real world Python. You're meant to be confused and meditate on it.
Well... full support for 3.6 ended in December of 2018 (now it only has security fixes), the older versions are already unsupported.
Also, this change was implemented 3.6, but in 3.7 they officially documented it as a language feature (i.e. that all other Python implementations also need to preserve the order).
Furthermore, there's a lot more stuff that are not backward compatible after 3.6 or 3.7, and if you're writing a library that targets other versions, I would hope that you have tests for all said versions.
> Code written for Python 3.7 might break on older versions of Python
That's a truism. For all versions of Python. If you use feature of python ver X, you should not be surprised that it doesn't run on versions less than X that lack that feature!!!
If you write a Python library and use feature of Python ver X and don't mark library as only >= Python ver X, you are doing it wrong and a horrible person.
Dicts have been effectively ordered since 3.6. Iteration order was literally randomised (at process start) before 3.6. I'm also not sure whether the behaviour under deletion was changed between 3.6 and 3.7 so it's possible that there are subtle differences there.
Portability isn't affected; if they claim compatibility with python3.7, then they claim their dicts have insertion-ordered keys.
If they claim compatibility with only up to python3.6, they can have whatever order they choose.
The only issue with portability is that I think the main reason it was made a gaurantee is that cpython found the new, presumably optimized, implementation came with insertion order for free, so they went ahead and gauranteed it. But that might not be an optimal strategy in other areas, but they're forced to follow along anyways.
But actually moving cpython code to say ironpython should not be impacted, unless ironpython lies about it's compatibility
The insertion order dict implementation comes from Raymond Hettinger who is amongst other things a core CPython developer. pypy pulled the trigger on using it first (and probably has optimisations CPython doesn't). PHP also used it before CPython did, IIRC. And possibly Rust (as the third-party indexmap).
> If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond. This allows the creation of (value, key) pairs using zip(): pairs = zip(d.values(), d.keys()).
Sorry, yes, I was hasty. Ordered dicts supports such a thing with intervening modifications:
oldkeys = list(D.keys())
D[z] = foo(z)
now if you take
zip(oldkeys, D.values())
then you're guaranteed to iterate over oldkeys, with the proper values associated with those keys -- if z was an oldkey, its value got updated; otherwise, it comes after oldkeys and gets dropped out of zip.
The subtlety of this is what I, and perhaps others, find the most jarring.
This won't break existing code. It will break new, good code that gets back-ported from eg. a 3.7 tutorial to a 3.5 environment, without any syntax errors.
> Changed in version 3.7: LIFO order is now guaranteed. In prior versions, `popitem()` would return an arbitrary key/value pair. [1]
If they added a new `popordereditem()` method, good 3.7 code would use that, and an attempt to run that code on 3.5 would throw a reasonable, useful error message. If they wanted to play it safe like Rust or Go, they'd add the ordered method and make popitem() deprecated or make it artificially use random/arbitrary order so you can't accidentally depend on a new implementation detail and have a test case work.
Also, your case 4 does deserve some protection because while bad code it's hard to test for bad but working code. Implementation-specific or undefined behavior that works is the worst kind of problem, because it's hardest to test against. Compile-time syntax errors are the easiest, runtime errors are testable with a solid test harness, some possible can be identified as warnings with linters, but sloppy code requires manual inspection to detect.
Actually, no, I take that back: Implementation-specific or undefined behavior that works sometimes! is the worst kind of problem. That's what this code enables; if your test case is `dict({'one': True, 'two': True})` and you have a unit test where you popitem() to assert that you get 'two' and then 'one' it will pass the test harness on 3.7, it will pass on 3.6 because of implementation-specific behavior, and it will pass on 3.5 because the hash map arbitrarily puts those particular items in order. But it will silently get it wrong when you pass in user-supplied data. Shudder.
I'm of two minds. In principle, no, I don't love the idea of a change like this happening on a minor version.
But, from a pragmatic angle, it strikes me as a genius move. As the article said, this was a natural by-product of a performance enhancement that was made in 3.6. In principle, that was fine, because there was officially no predictable ordering, so a change to how it was being ordered in practice shouldn't materially affect anyone. And nobody really paid much notice to it then.
And that's where the danger lies - there's risk there, because, as you said, someone who's used to Python 3.6 might have problems switching to 3.5. And that's probably a greater risk if they left it out of the spec, because people would have continued to not pay it much notice. By making it official in 3.7, though, they've sent out a message that the Internet hate machine has ensured that everyone will hear. That probably, in practice, actually reduces the risk.
And, while many understandably see this as violating the spirit of semver, it unambiguously does not break the letter: Minor changes are only supposed to avoid breaking the software's compatibility with code written against an old version. There was never any rule saying that old versions have to be compatible with code written against the new version.
More important (in my opinion), adding a random seed to your hash prevents collision attacks, where an attacker tries to ruin your performance by sending many values (e.g. HTTP params) that would hash to the same bucket.
Many (most?) programming language runtimes now do this, especially the web-focused ones. But it's completely orthogonal to insertion-ordered iteration (which is implemented by a list).
That surprised me, when I first saw it. If someone takes a minute to read the docs (e.g. Python), one would never build something knowing that order is not guaranteed (e.g. order in dictionaries). But seemingly people do.
Had some code break while porting from 2 to 3 because it assumed dict order and it was part of the official Python build system, the old spark parser for the AST module.
My main issue with it is that if someone figures out an even better way to do dictionaries but it doesn't preserve insertion order, then that technique can't be used. But I certainly wouldn't call it a stupid decision, the problem is that once a behaviour is observable then people absolutely will start relying on it.
I might have preferred odict be added and dict left alone, so that you didn't have to import OrderedDict when really you just want the new dict. If dict and odict mapped to the same thing for a while, fine, but dict could diverge again if desired.
The problem is that when you behave a certain predictable and deterministic way users will leverage that even if it's not specified.
That is actually why the Python maintainers decided to make this behaviour official: the ordering was the consequence of changes in implementation details, but over 3.6's lifecycle they feared it would cause compatibility issues as users would start relying on the ordering properties of CPython and that would be an issue for alternate implementations (though pypy had switched to the same implementation even earlier so was also insertion-ordered even when running in Python 2.7 mode).
Providing stronger guarantees was considered useful and unlikely to be severely detrimental in the long run, so the project decided to make it part of the language spec.
You could add an ordered dict by a different name and also have dict be the more efficient ordered version so that people get the efficiency gain for free but don't have to worry and forward compatibility.
Maintaining insertion order is strictly a product of adding an additional list to whatever more efficient map you implement. Add something to a dict, push it onto the end of the list. When iterating over the keys, use the list instead of the dict structure.
It does come at a cost, but I think Python aims for ease of use over runtime speed and memory efficiency, so it seems perfect for them.
> I would consider changing how a builtin works to be a major breaking change.
This one is unusual because it won't break old code being brought forward, only the other way around, and theoretically there are lots of things going the other direction that would break (though most of them are explicit, not silent).
That's not what people call backward compatibility.
Backward compatibility is the ability to run old code with new interpreters. This is not broken here.
What's broken is the ability to run new code on old interpreters. But this is already broken at every python update (new operators, methods, syntactic sugar..). We could call it reverse backward compatibility.
No, this change is backward compatible but not forward compatible. Old code (written for old interpreter) works in new interpreter; but new code (written for new interpreter) is broken in old interpreter.
That is an original line of thinking in the world of software, “Let’s not add a feature to a new version because it wouldn’t work if used in an older version”.
Right. In fact, if it explicitly broke something there'd be less of an issue. Instead, it will appear to work/interpret but you'll get unreliable results. That makes the bug more nebulous.
I think you are giving it bigger weight than it deserves. Writing Python code I would not want to depend on internal order of dict elements without a very good reason. If I am writing some low level code that could benefit from it, maybe I would use this feature, but I would then encapsulate it and insert a version check to complain.
This is like many evaluation guarantees in latest C++-NN standards. Good to know, might be useful for performance tuning of automatic code generators, etc. , but irrelevant most of the time. My 2c.
Not only f strings, but any change really.. New operators, new methods on standard library objects... Python never guaranteed compatibility of new code with old interpreters.
If you rely on this feature, surely you know it’s new to 3.7 and document your project accordingly...? It’s not like code magically appears, somebody has to write it.
3.5 is EOL soon, and it's already uncommon to write code using a new interpreter that targets an older interpreter - there's more than just ordered dicts that you'll be missing if you do so.
People will just do what they've always done - they'll be aware of the differences, or they'll test.
There's nowhere I have seen that mentions dicts are ordered without mentioning in the same paragraph the version since which this has been the case, so anyone who knows they're ordered will be aware.
Code that assumes no order cannot be broken by a specific order unless it was setup to assume everything but that specific order in which case it was already broken, but only randomly showing it’s broken nature.
Actualy, dict are already ordered since a long time(since python 3 I think), but it wasn't certified. It was just the result of a new implementation of dict structure.
I can't imagine how it can break something. In which case can you have an advantage to have an unordered list? Biggest downside can be about perf, but I don't think it's the case here.
For anyone who came to Unit Testing with the XP era and happened to use Java, Java 5 introduced changes to hash table ordering that somehow resulted in Junit 3.0 running some tests in a different order.
It's important to discover coupling between your tests, this just isn't the way most people want to do it.
I don't see exactly how one would rely on dicts not being sorted, as it was already answered by another comments.
And beyond this, even if maintaining backward compatibility is important, I don't think it should prevent software from being improved, and dicts being ordered is pretty awesome. I'm pretty sure relying on unsorted dicts might involve pretty hacky code.
I recently stopped writing a z-order test application because I remembered that dicts are not sorted, so it made my goal a little pointless. Meanwhile C++'s std::map is sorted. I think there was already sorted dict in python, but I don't remember.
EDIT: I'm wrong, I'm mixing "ordered" and "sorted"
I agree, but more from the point of view on what if a new unordered implement if discovered 10 years from now that’s better? Would a new type have to be added?
I think you bring up a good point. If someone hands me a python script and told me it was 3.7 I would only hear 3 and expect it to run or fail, not to run with differing outputs!
Dot-upgrades do allow for breaking changes, but this one can break silently and I’m not sure where such breakages neither belongs in python nor where they should belong.
Personally I’m happy to get ordered ducts sooner rather than later.
I think if you are a library developer targeting different versions of Python, you generally know what features you can and can't use from new versions of Python (ie not many), and ideally you are also testing with each version in CI.
No, I agree, it's stupid. To me it feels like a kind of conceptual backwash from JS. The whole P3 fiasco looks really silly from far enough away.
It's like, you have a great marriage with everything working out and some fine children who are doing great too, but you get a divorce anyway to pursue your dreams of being a conceptual artist.
(BTW, in case anyone is keeping track, I was going to maintain P2 but got caught up in Prolog, of all things, and now Python looks just as crude as everything else and I don't have the requisite love anymore to shoulder the work. I might make a Python 2 interpreter in Prolog though! That would be fun.)
Javascript doesn't specify performance characteristics of objects and arrays, so even with implementations, one object could be a hashtable, another a balanced tree.
The implementation doesn't matter to EGreg's point, which is that ES standardized that iterating over object keys are generally required to return keys in insertion order.
In the past (from before this standardization) Chrome had in fact changed the object iteration order due to an optimization, and had to revert it after lots of complaining on their bug tracker.
(To be precise, the spec still requires array index properties to be returned ahead of other properties regardless of insertion order. The behavior that Chrome "reverted" to is this new one, so not exactly the same as the original behavior.)
This is an amazing contribution to the language. A mixture of speed and convenience, probably made by volunteers. As for people criticizing a change to what use to be a non-deterministic ordering of a dict iteration; I don't know what to say to them, other than, are you serious? There are people out there who are working for us, they work for free and they did some heavy lifting to give us this. They might read what you wrote and think, "Why bother? Maybe I should spend my weekends playing with my kids instead."
> As for people criticizing a change to what use to be a non-deterministic ordering of a dict iteration; I don't know what to say to them, other than, are you serious? There are people out there who are working for us, they work for free and they did some heavy lifting to give us this. They might read what you wrote and think, "Why bother? Maybe I should spend my weekends playing with my kids instead."
I don't agree with the complaints about ordering ducts, but this is a ludicrous response. Appreciation of volunteers' work on Python isn't diminished by criticism of language features or changes. In fact, it enhances it: if people have opinions about the project you work on, it's a good sign that it's important, significant work. I've contributed to OSS projects before, and the ones in use by more than just my friends and I were naturally the ones that felt the most meaningful.
As it’s widely known, more often than not “criticism” of open source software quickly devolves into hate and toxicity. Helpful criticism is great but be careful with defending the “criticism culture” around foss, it’s often angry unhappy people that want it all for free on a golden platter, and yesterday.
Sure, hate and toxicity themselves should be called out. But that's not what's happening here, and not what GP comment was referring to. It doesn't make any sense to throw out the baby of legitimate criticism with the bathwater of toxicity.
It does matter. If people get hate for doing work on open source they will stop doing work on open source. Yes you can be critical of everything, but the hate that some people spew toward certain open projects is not only nauseating but actively damaging to the community.
Users of JSON, probably the most common data interchange format on the planet, frequently have implicit requirements about key ordering. It is highly convenient to be able to parse a JSON string into a native Python data structure, add a field, emit it back, and preserve the ordering.
From what I recall of the JSON standard itself, there's no guarantee about key ordering being significant. If you're diffing serialized output to compare two JSON objects you need to be serializing it in a consistent format, otherwise even whitespace is going to throw you off.
If a human is inspecting serialized JSON using pen and paper, the human is presumably clever enough to match up key for key regardless of ordering.
If the human is using a computer to compare two JSON payloads (as the use of a diffing algorithm suggests), the human and computer should be clever enough as a team to realize that they could just deserialize and reserialize each JSON payload such that the keys were lexicographically sorted and the data was pretty-printed in the exact same way before running it through the diffing algorithm. `jq -Sc` would do the trick.
Most diff programs don't have what you describe. And in a lot of cases you don't have the easy ability to "do stuff" before running the input through a diffing algorithm.
> Most diff programs don't have what you describe.
That’s why pipes were invented.
> And in a lot of cases you don't have the easy ability to "do stuff" before running the input through a diffing algorithm.
Well, in that case you’re not going to be able to meaningfully compare two JSON payloads because neither order nor white space have any semantic meaning in JSON. I’m really curious what you’re talking about though, since if you’re working on the shell you can easily use jq to do as I describe.
I tracked down a bug once where someone was hashing state that happened to include Python dicts serialized to JSON, which caused different hash values for the same entity. I know, I know, insert tears of blood emoji, and any engineer worth his salt wouldn't be so sloppy. But you don't always get to choose to work only with top talent that understands why you should design things elegantly as a baseline[1]. In these cases, removing implicit gotchas like "the same code has the same behavior" (ie iteration over the same dict) is valuable.
[1] Well you _can_ choose to do so, and I recently have, but it's not without its costs.
I can't speak for all "other" languages but Ruby made the change years ago from 1.8 to 1.9. Ruby calls them hashes but they're standard maps. Obviously Ruby and Python are very similar but I think the point stands.
Convert a deck of cards ([]Card?) to a map (map[Card]bool?) and back just to shuffle? That's unlikely to be faster or more idiomatic than a straightforward implementation of the Fisher-Yates shuffle[1]. Try writing the code to do it both ways and compare.
I think "idiomatic" would be using rand.Perm, which implements the Fisher%E2%80%93Yates shuffle. But aside from whether it's idiomatic, converting to a map isn't random enough.
With Golang 1.11, I get orderings like this. S is spades, H is hearts, D is diamonds, and C is clubs, because HN eats the Unicode characters I was actually using.
S9 SJ SK HQ D9 S4 S5 S10 HJ S8 H6 DA D2
D4 DQ C6 C8 CJ H3 H9 DK C3 CQ SA S2 S3
S6 SQ H4 H10 D8 D10 C4 CK S7 H7 D5 D6 D7
C7 H2 HK D3 DJ C2 C5 C9 HA H5 H8 CA C10
On average you would expect about one card in a shuffled deck to be
(cyclically) followed in the deck by its succeeding card, and on
about one out of 50 shuffles, that card would be followed by its
successor. Here we have S4 S5, DA D2, SA S2 S3, and D5 D6 D7.
In particular it seems like there ought to be a less verbose way to express the equivalent of the Python list({card: True for card in deck}) in Golang.
IIRC it used to just start iteration at a pseudorandom index and then iterate normally. I looked at it couple years ago, don't know if they changed it.
It seems to do that, but I think it also tweaks the hash function for each newly created map, because in the code linked from https://news.ycombinator.com/item?id=22278753 I'm not getting rotations of the same iteration order when I generate two maps. There doesn't seem to be any randomness in mapiternext() itself.
You could imagine that the hash function itself might do an adequate job of randomizing the order of the cards, though, especially if salted with a per-map salt. SipHash, for example, would probably not have any detectable biases in the distribution of the permutations thus produced. But whatever hash function Golang is using for my structs has an easily visible bias, as described in that comment.
What does it say? That Python is willing to spend more computation time for simplicity, without waiting for the programmer to ask for it? That Python will make semantic changes in minor version bumps of the language?
The ordering property is a side effect of the new and more cpu/memory efficient data structure. It would be very surprising to say no to a 2x performance jump to avoid the (sometimes useful) ordering.
Java added LinkedHashMap a while ago. I tend to default to it except for small temporary use cases where order absolutely won't matter or have an outside impact...
LinkedHashMap is not Java's "standard maps" though. If you tend to use it by default it's really a personal idiosyncrasy, especially as LLHM tend to be larger and slower than regular hashmaps due to having to maintain a doubly linked list.
Python has had one such in the standard library for a decade or so.
LHM is not slower for iteration (it's faster actually).
LHM indeed pays 2 references per node but they are well worth as it has deterministic ordering/iteration and I have witnessed numerous cases with HashMap that show up in production only due to iteration order (esp after rehashing). The code is broken but the cases did not reproduce during testing...
Now if the two extra references are an issue, consider that HashMap is quite space inefficient with having a dedicated node per each entry - that's the main price. The (simple) node memory footprint is 36bytes on heaps less than 32GB, i.e. compact pointers. The extra references add another 8 bytes for having an insert or access order.
If the goal is getting a compact low memory footprint, HashMap is not fit for purpose. Overall it's the jack of all trades and even got reworked (java8) to support tri-based collision resolution for keys that implement Comparable.
Couple years back I wrote CompactHashMap (under CC0) that has an average cost of 10bytes per entry and it's extremely compact for small sizes with having only 2 fields on its right own, so even small/empty maps are tiny. In microbenchmarks (same used in openjdk) it handily (2x) beats java.util.HashMap on "Traverse key or value", get/put have similar performance, and "Search Absent" is worse.
The point is: LHM should be the go-to hashmap for java as being node based hashmap is justified (unlike java.util.HashMap)
Yep. Also, LinkedHashMap maintains ordering based on insertion order. So, iff you insert your data ordered how you want it, it's behaving as an ordered map. If you want true, self-reordering map, what you want is a TreeMap. Beauty with that one is that you have full control over defining the custom ordering function, because we're not always indexing a map by primitives or autoboxed type. I try to use it sparingly though as you're paying log(n) on pretty much all map operations.
While we're here, another tidbit that's often overlooked in the Java collections:
If you reallly care about iteration performance, your data is without nulls, your data is already ordered how you like it or you don't care about ordering, your qty items >= 10, and you don't need random access, then ArrayDeque is gonna be your horse because of how much better it co-locates its contents in memory and how much less overhead is required to maintain it during each operation compared to all the other List implementations, including ArrayList and LinkedList.
Interestingly enough here it was kinda but kinda not the other way around: historically PHP used a closed-addressing hash map and threaded a doubly linked list through it to maintain the insertion order.
But the dict TFA talks about doesn't use a doubly linked list, or closed addressing, its ordering is a side-effect of its implementation but not originally a core goal (memory saving and iteration speed were). It'd probably been proposed by others before but it came to wider attention after Raymond Hettinger (a core dev) proposed it a few years back[0]. PHP actually released it first[1], closely followed by pypy[2]. CPython only got around to it some time later[3]
People who bother to complain are those who actually care about your thing. People who do not care simply leave without ever telling you why. Your complainers are often your most dedicated and invested users.
Though in this specific case, my sense has been that your complainers are often your most dedicated and invested users of a version of the language that was officially retired 38 days ago.
Python is just bizarrely political. I suspect that it's now cursed for all time to have every future PEP become a battle in a proxy war over PEP 3000.
I feel like there could be some other testcases. I'm wholly in support of the change (regardless of benchmarks) but depending on the results I could see some arguments against.
I was under the impression python dict ordering was deterministic, just not in an order recognisable by humans, i.e. ordered by hashes. Was this not the case?
From the docs:
"CPython implementation detail: Keys and values are listed in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions."
So I would claim that equates to non-deterministic
I read that as: if two dicts are constructed in the same way and use the same python implementation then they'd be the same. If you change the way the dicts were generated or the underlying implementation then they're not the same.
To me, nondeterministic is when the input does not change but the output does. The docs essentially say to not rely on the order, not that the result is not reproducible.
CPython hashes are intentionally non-deterministic, so even the same exact Python install will produce different hash orderings on the same code, if you run it twice (although you can make it deterministic by using PYTHONHASHSEED).
That's generally what people mean when they say "nondeterministic" in the context of computing. Yeah, in philosophy it generally means something like "the future is not completely determined by the past," but in computing it means something closer to "the programmer cannot reasonably determine the behavior and thus should not depend on specific behavior."
In computing it means a given set of inputs lead to a given set of outputs. It has nothing to do with how difficult it is for a programmer to reason about. Deterministic builds, deterministic tests, etc.
From the origin to Python 3.2, iteration order was arbitrary but deterministic (in CPython, it was not guaranteed to be determinstic in other implementations)
From Python 3.3 to Python 3.5, iteration order was non-determinstic (across runs, it was deterministic per-process instance)
In Python 3.6 it's unspecified but deterministic and follows insertion order (in CPython)
In Python 3.7, Python 3.6's behaviour was specified and documented
I don't have any criticism if the performance of this new dict stays the same as earlier and that's a huge if. Considering python3 is already a crawling language compared to peers like java and php, I think devs should focus more on that before handing out features.
> I don't have any criticism if the performance of this new dict stays the same as earlier and that's a huge if.
1. it's been there since 2016, the new dict was released as part of Python 3.6 (though the ordering was only an implementation detail until 3.7)
2. the new dict is much more compact (20~25%) and should have much faster iteration speed, those were actually the original goals, ordered iteration was a side-effect
3. the new dict should not benchmark significantly slower across the board, it wouldn't have been merged if it were a regression
Great decision IMO. I remember maintaining a bunch of Python code that we supported on both OSX and Windows. Out of all the platform-specific bugs we had (where it worked on one OS and not the other), one of the most common causes was code that relied on a certain key order. And we knew that relying on key order was bad, we're supposed to use things like OrderedDict, blah blah blah. It was still a really easy mistake to make, just like it's really easy to have memory safety bugs in C.
At some point, if a human error is common enough, it makes pragmatic sense to change the design so the error is impossible.
One particular annoyance of OrderedDict objects was initialising them with literals. You couldn't just do:
od = OrderedDict({
"y": "first",
"x": "second",
})
because, by the time the data got to OrderedDict's constructor, it had already been through a dict. Instead you had to supply a list of lists (or tuple of tuples etc.):
od = OrderedDict([
("y", "first"),
("x", "second"),
])
For hierarchically nested dictionaries, these are quite a bit harder to read.
FWIW they could have used a special-purpose dict guaranteeing order for that. I don't know if they ended up using the normal dicts but the proposal to keep kwarg ordering predates the new dicts, and the PEP even notes that it's unclear the new dicts would be merged.
Same with class attribute ordering, PEP 520 still talks about using OrderedDict, however it was only merged after the new dicts, and even if ordering was not defined at the time (as part of Python-the-language) PEP 520 could use this internal property just fine and the ordereddict-based version was stripped out, the PEP was essentially accepted as a no-op (the ordering properties would just be officially documented).
The Go designers went the other way (as they often do):
> When iterating over a map with a range loop, the iteration order is not specified and is not guaranteed to be the same from one iteration to the next.
Actually it is not only "not guaranteed to be the same", the runtime actively makes sure that the iteration order is actually different so you don't even start to rely on it...
In the case of C# it is/was more a matter of picking the -correct- Dictionary.
'OrderedDictionary' has been around since Net 2.0 (2005 or so,) as has 'SortedDictionary<TKey, TValue>'
But for some ungodly reason, there is no 'OrderedDictionary<TKey,TValue>' included in the framework that you can use. All you get normally is the non-generic one (where you have to type-check everything...) There are internal implementations used by the framework, but you'd have to do some hacky things to even get it to work.
However, in my 12 years of C# programming, I've only run into the need for this _once_, and it was to assist interfacing with a VB6 monstrosity.
It's a very small performance hit because it doesn't do the work to be uniformly random, just "not always the same". If you think about how you have to iterate through a hash table or similar data structure anyhow, it's either O(1) or O(log n) paid once per "range" on the data structure, which is dwarfed by the actual act of ranging on the data structure.
Go's philosophy is also definitely willing to pay that price to avoid a large class of known bugs that has hit all kinds of code bases. It is not about being the fastest language. As compiled languages go, it's solidly middle-tier, and not likely to go up much from there. (Among the C-style compiled languages, it's low-tier on performance, at around half the speed of C in general. However there's enough compiled languages like Haskell that are still generally relatively slow so that Go is mid-tier for compiled languages over all.)
It is a performance hit and I initially had the same reservations. Importantly, the implementation imposes a small overhead on the access time of iteration over a map, but not in a way which changes the scaling behavior. Other map operations are not implicated.
I think the decision was wise - it prevents user code from relying on ordered iteration behavior, which allows the go team to switch to different map implementations without fear of breaking user code.
If the order had been unspecified, but iteration was still ordered in practice, there'd undoubtedly be (incorrect) user code that relied on that behavior.
Amusingly, I now run into engineers who are under the misconception that map iteration is random, and proceed to use maps as an RNG. That's an unfortunate mistake because Go's map iteration is quite non-uniform - it's shuffled just enough to appear random to the untrained eye, while be performant.
> That seems like it would be a performance hit to actively make it different
Not really. I don't know the go implementation but you can get this behaviour by adding an arbitrary "startup defined" seed to your hash function and that does the trick.
It also gives the benefit of making hash table attacks harder.
Python kinda sorta did the same from Python 3.3 to 3.5: iteration order was not actively randomised but the hash was "seeded" (for DDOS resistance) so iteration order would only be coherent within a run, which in effect made it unreliable.
Python 3.6 changed the underlying implementation to one which incidentally made iteration order not rely on the hash function, and as that seemed like a useful property and one which risked causing compatibility issues with alternate implementation (developers would come to rely on cpython's iteration order and code would break on jython or whatever) they decided to make it part of the spec for 3.7.
Ah i love go.
While its mega cumbersome sometimes and lacks generics (WHY????) , i found that i write pretty stable code in it that still looks good after years (^^)
I haven't done a ton of Python, but I can't really think of a situation where relying on a dict to be ordered is an easy mistake to make. Do you have an example?
I saw this in Perl a long time ago but it could just as easily have happened in Python. The dict (Perl hash) was a set of mappings for template replacement of "from" strings to "to" strings.
"FOO" => "bar" # Replace FOO with bar
The author had considered the case where one key (FOO) might be a left-substring of another (FOOBAR), and so reversed the output from keys() before iterating over the hash. This ensured that FOOBAR, which sorts later, would be substituted before FOO (else FOOBAR -> barBAR, oops).
The Perl example sounds like someone incorrectly assuming that the output from `keys` be ordered, when it's explicitely not:
> Hash entries are returned in an apparently random order. The actual random order is specific to a given hash; the exact same series of operations on two hashes may result in a different order for each hash.
Applying a `sort` to the output of `keys` is second nature to me - I'm always aware that hash entries are unordered in Perl.
I definitely recall there being no fall out or or problems when ruby did it in I think 1.9. Semantically, I can't think of how there would be. Since previously order was unpredictable and not guaranteed, it could be anything at all; always using insertion order is, after all, one possible value of "anything at all".
Only possible downside might be performance (or memory). I recall the ruby maintainers ensuring they had an implementation with very little if any performance/memory implication.
The most common situation I saw was where the entries from the dictionary were being iterated over and used to perform edits on a structure whose edits were nearly commutative.
For example, suppose you have a dictionary containing PriorityItem values. PriorityItems are sorted by priority, while the value only matters for equality (affects __eq__ but not __lt__). If you have a dictionary whose values are PriorityItems and you say `sorted(dictionary.values())` the order of the output could vary w.r.t. equal priority items. So if you wrote a test asserting a process hiding a dictionary had a stable ordering of the items, it might consistently pass by accident on your machine and then fail elsewhere.
I'm glad to see other languages finally catching up to PHP. I'm joking (kind of) but after a lot of years of doing this, I've begun de-prioritizing pure abstractions and favoring the way that humans tend to do things on their own.
Technically this is along the lines of the worse-is-better philosophy. The single biggest cost in software development is friction. Performance, size, etc are all less important, because they become less important with each passing year as computers grow more powerful. And along those lines, I think that silly concepts like assigning letter names to drives in Windows, or having a paltry few registers in x86, made those architectures more accessible to the masses than the more generalized Unix and Mac platforms. Not that I completely agree with that, just, it's all I can come up with (other than cost and familiarity) for why people are so attached to worse-is-better paradigms.
Preserving insertion order has some small cost associated with it, but that's going to be dwarfed by the cost of bugs introduced when junior devs are surprised by unordered dictionaries. And the pedantic people can just use an unordered dictionary manually since they are already converting pure abstractions to whatever imperfect implementations are provided by languages anyway.
>Performance, size, etc are all less important, because they become less important with each passing year as computers grow more powerful.
That isn't as true as it once was. We are running up against fundamental limits and people are even starting to talk about the environmental impact of computing.
It should be noted that the ordering of python dict occurred as a side-effect of a new implementation with better memory profile (and as good or better performances).
PHP arrays may have been one of the "killer features" that lead to it taking hold of the web. It's so frictionless, powerful, and you can bend them to your will with ease.
"Ordered dict" is ambiguous, and from the title I thought that key order was meant. Reading the article, I see that it's actually insertion order. Which makes much more sense. Key order would have been a much more significant change.
A sorted map would be tree-based and require its keys to be orderable but not hashable. For instance Rust's HashMap has its key bound on Hash + Eq, while BTreeMap's is bound on Ord.
They're different data structures, with different use cases and different requirements.
> There's no ordering over _most_ hashable values since they span multiple types, so insertion ordering is the only sane way to do it.
Python 2 actually had total ordering of all values. The result was usually stupid but it was there.
Well, keys of sorted dictionaries don’t need to be hashable but do need an ordering, whether intrinsic or user-provided. A sorted map (usually implemented in terms of a binary search tree) is simply a separate data type with different requirements than a hash table based one.
Came to say the same.. insertion order of dict has been here for years for 90+% of Python users (CPython implementation) and a part of the language spec also for almost as many years.
I guess it’s good to spread awareness to HN readers who apparently were unaware, but the headline is very misleading.
90% seems optimistic. I don't think that many Python projects are running the most recent version in production. From personal experience, upgrading to 3.7 was especially a pain because of packages that used "async" for variable/kwarg names.
That’s a very good point, in that case even 50% would probably be optimistic.
We can probably attribute this article’s novelty to the slow adoption of Python 3. In that way it’s probably a good thing that it’s such a popular topic, even if it is old news.
I still see dict ordering as an implementation detail; not a technical one but a descriptive one. If you want to rely on insertion order, use collections.OrderedDict. It communicates your intention far better, and there should be no overhead.
OrderedDict is less efficient. It's best to replace usage with basic dict where possible to improve efficiency. There is one obscure feature of OrderedDict that isn't in the basic, so it can't always be swapped.
For the former, I realize that's how it's done now, but there's nothing forcing it to stay that way.
For the latter issue, as explained above, can't they just implement a replacement __eq__ only for OrderedDict, and still re-use the new dict implementation?
Similarly, any subtle difference can be shimmed on top of the new implementation inside Python, no?
The builtins have some optimizations from the compiler and interpreter that Python-defined classes don't benefit from. The dict type especially so, because it's used to implement so many things.
Cool, but, this just encourages people to rely on hashmaps being ordered by insertion. The nature of a hashmap is not one where you rely on the ordering. This is going to trip up newbies who then move to other languages.
Some comments on the post don't seem to understand the changes, for example, the C++ comment saying that std::map is ordered, so this isn't a strange change. The difference is that std::map is a red-black tree implementation, which has a traversal order, not a hashmap with insertion order. So there is already confusion.
I usually feel like a reasonably bright person, and then I go to a talk like this and suddenly remember how terribly much I don't (and will never) know. There are some very sharp people at play here.
This is a solid argument for why you should have more datatypes available to you. If your program relies on an insertion order, say so. If anyone ever moves to another implementation, they will thank you for it.
Ordered dictionaries already existed as a separate type. Much of the gain was in the use of dicts for core language features.
In particular, Python 3.6 guarantees that the order in which class attributes are defined is preserved, and that functions which take variadic keyword arguments receive them in the order in which they were passed. It permits other implementations to use types other than dict to accomplish that.
Dataclasses make use of ordering of dictionaries in a subtle way. If you write:
class Foo:
bar: int
baz: str
Then Foo.__annotations__ is a dictionary {'bar': int, 'baz': str}. The @dataclass decorator transforms that into standard boilerplate, and it needs to know the order to do that.
And this is all the more reason using a type would make sense. Curious why things use ordered? Look at all the places that declare they need ordered. :)
You don't know in advance when ordering might be useful, though.
Class __annotations__ was new in 3.6. For its original use there's no need for ordering. They were ordered, because dicts were ordered, but only as an implementation detail.
Dataclasses were added to 3.7, after 3.6's features made a nice syntax possible. If __annotations__ hadn't happened to be ordered already then I would guess it wouldn't have been made ordered just for dataclasses - dataclasses just wouldn't have existed.
Making everything ordered in one go opens up possibilities you haven't even thought of yet.
That argument cuts both ways. You don't know when the benefits of random will be there...
As soon as you do something that cares about order, state how it is derived. Sometimes, insertion order is right. Sometimes, not.
Don't get me wrong alists are nice. And order def matters to those. But it is part of their definition. And reinsertion changes the order in obvious ways. Not even clear what it does to just "dict".
Now, I will concede this is overblown. Life will easily go on.
I fail to see what's annoying how. OrderedDict provide features for manipulating ordering, but they are expensive and slower. dict initially became ordered as a consequence of implementation details changes, and that was considered useful enough (or likely enough to trigger compatibility issues with third party implementation) that the core team decided to standardise it rather than break it (one of the advantages of the new implementation is iteration performances, shuffling iteration would probably have slowed it back down more than the old implementation)
Annoying is that it is not obvious why the two exist, by their name. They are both ordered. One is orderable in that you can change is order. I would not guess that looking at this.
For a language that prides itself on being readable, that is an amusing quirk.
For some definition of "now". Python 3.7 was released in June 27, 2018. So "now" is "for the past year and a half". Python 3.6, which implemented the change originally, was released in December 23, 2016. By that measure "now" actually means "for the past 3 years".
The whole Python 3 affair made me realise that a lot of people upgrade their tools like once every 5-10 years. Which is shocking, in an industry as fast as this. I’m not saying you should churn your whole build setup every 6 months (I think release speed for Python is a bit excessive at the moment, for example), but complaining in 2020 for a feature that was released almost two years ago, was in development for a year or so before that, and was widely publicised... I mean, c’mon.
This is a refreshing update for those of us that like having code which will always act in the same way across multiple invocations.
Now, if only Lua could follow the same path with their “tables” (“tables” is what Lua programmers call their form of Python’s “dictionaries” and Perl’s “hashes”).
I just spent eight hours earlier this week debugging Lua code which would run differently on different invocations of the same code.
The standard way to iterate in a table with Lua is like this:
for key, value in pairs(foo) do
One problem: The order we get elements from the table “foo” is undefined, and it can change between different invocations of the same Lua code, even if the elements were put in the table in the same order. In order to fix things so that we can iterate a table in a consistent manner, this is my fix (public domain [1], if those who want to copy and paste it):
function sorted_table_keys(t)
local a = {}
local b = 1
for k,_ in pairs(t) do -- pairs() use OK; will sort
a[b] = k
b = b + 1
end
table.sort(a, function(y,z) return tostring(y) < tostring(z) end)
return a
end
Then we iterate the table like this:
for _, key in ipairs(sorted_table_keys(foo)) do
local value = foo[key]
(In Lua, two dashes indicates a comment.)
Note that this code will not always sort in the same order all tables. If we have a table with the keys 1 (as a number) and "1" (as a string), iteration order is still undefined.
You can write an iterator yourself, no need to leave ?pairs idiom:
function sortpairs(t)
local keys = { }
for key in pairs(t) do
table.insert(keys, key)
end
table.sort(keys, function (a, b)
return tostring(a) < tostring(b)
end)
local i = 1
return function (t)
local key = keys[i]
i = i + 1
if key ~= nil then
return key, t[key], i-1
end
end, t
end
t = {a=10, c=20, d=30, b=40, 50}
for k,v,i in sortpairs(t) do
print(k, v, i)
end
What Lua is lacking here (and why the above iterator function needs 17 lines) is the ability to have “for” go through a list (without converting the list in to values returned by an iterator function), which would let us quickly and easily sort lists that “for” can use. Something like:
d = {"foo": 2, "bar": 1, "zoo": 4}
for k in sorted(d.keys()):
print k
(I’m not advocating Python here, since Perl has a similar way of using “for” to go through lists which can also be easily sorted)
However, with Lua, “for” only accepts a numeric range, or an iterator function, so customizing “for” requires understanding function closures: Understanding how a function, when called multiple times, stores variables altered in previous invocations of the function, and understanding how to give those variables initial values (usually in the “function factory” function which creates the function we use).
In other words, “for”, in most modern high-level languages, can be one of:
1. for variable in [something that specifies a numeric range]
2. for variable in [iterator function]
3. for variable in [list]
But Lua only has “something that specifies a numeric range” and “iterator function”; it can not natively go through a list.
You can convert a list first and then feed it to a simple iterator. I don’t fully understand what your exact real-code issues can be, but hope this snippet may help:
function vs(t)
local i = 0
return function (t)
i = i + 1
return t[i]
end, t
end
function sorted(t, cmp)
table.sort(t, cmp or function (a, b)
return tostring(a) < tostring(b)
end)
return t
end
function keys(t)
local keys = { }
for key in pairs(t) do
table.insert(keys, key)
end
return keys
end
t = {a=10, c=20, d=30, b=40, 50}
for k in vs(sorted(keys(t))) do
print(k, t[k])
end
I.e. if “natively” means strictly “for in t” that generates values, then no, Lua can’t do that. But if “for in vs(t)” is okay, then that vs() is the solution.
That looks good, and I think putting these in a prominent place of the Lua documentation (along with a notice that the code is public domain) would help us who are used to the AWK/Perl/Python/PHP way of having “for” natively traverse a list without needing a complicated list-to-iterator function that uses function closure (i.e. the iterator function remembers the value “i” -- I’m writing this for the lurkers because code like this can be difficult to follow).
One honest question: Is there any reason why the function factory (i.e. a function which returns a function) which converts a list (Actually, table with ascending integer indexes) in to an iterator Lua can use with “for” returns both the element and the entire table here? Here is the code I am asking about:
return function (t)
i = i + 1
return t[i]
end, t
I’m curious why we’re returning both the table element for the iterator and the entire table.
No reason to argue about which one to make "dict". In fact it would be better to have both because youre taking a performance hit (a significant one) by ordering the entries.
Python has had both ordered and unordered maps in the same languages for 10 years or so (OrderedDict was added to the stdlib in 2009).
In 2016, the standard dict was moved to a new and more efficient implementation (more compact & faster to iterate). It was naturally ordered, and so as a side-effect of this change and since the CPython devs did not explicitly scramble it, iteration on dicts became both deterministic and specific (it would always follow insertion order).
This was considered both a boon because conserving iteration order is strictly more useful than not doing it and a woe because it would cause compatibility issues with third-party implementations or further changes down the line as users would absolutely start relying on the iteration order even if it was still considered unspecified.
So in the next major release they decided to resolve the issue by making this behaviour part of the language. It restricts future design space, but they can always add an unordered dict as part of the stdlib if that ever truly becomes useful so meh.
The new python dict implementation is benchmarked as faster in both microbenchmarks and in practice for almost all real-world workloads.
Note that this isn't a sorted map, like C++'s "Ordered Map", but an Ordered map. C++ get's the name wrong. Items aren't ordered by key comparison, but ordered by insertion time.
Not sure why this is posted and upvoted to front page now. After all this is a major bullet point in py37 What’s New, and even py38 has been out for a while.
Anyway, I’ll keep using collections.OrderedDict (except for personal scripts) until py35 EOL.
I think this change is great, but this really only becomes news again once all major LTS are shipping with at least Python 3.7, right? No one can really use it in code they plan to distribute at the moment. Maybe I am underestimating the amount of Python code that is meant for internal or personal use only.
Many apps bundle Python along with their app, either within a container image or as a complete portable copy distributed with the binary. We've been building FROM python:3.8 for a few months.
You don't actually remove the entry, you just mark it as deleted.
Eventually if too many things are deleted you repack the array. Still amortized O(1). (No different than a hash table in general, which will need to recopy the underlying array when it grows.)
You can use memcpy to remove items from the middle of a memory segment. It's probably not a single memory segment either; I imagine it works more like a Golang slice.
Java has many types of Map, Set, and List implementations. And that's just in the core library. Here are some Map implementations that come with Java: TreeMap, a HashMap, A LinkedHashMap, ConcurrentHashMap, EnumMap. There are many more and they each have their own uses, features, and performance/memory characteristics.
I always wondered why languages like ruby, javascript, and python don't include a bit more choice on this front. It seems like sets are still somewhat of a novelty for javascript developers and people just wing it with half-assed solutions involving stupid O(N) contains operations.
The ordered dict for Python was originally an implementation detail, part of a more compact dict representation that was first proposed for Python in 2012 [1], implemented for PyPy in 2015 [2], and merged into CPython 3.6 in 2016 [3] [4].
nevermind, i thought this was a future change, but it's already live. i've been using 3.7.4, and can confirm that two dictionaries with equivalent key,value pairs, even if inserted in a different order, are considered abstractly equal.
i mean, i would hope not if it's an implementation detail and not a change in the abstraction itself... feels a bit like it is breaking the abstraction, though.
The definition of a dict hasn't changed. It's still defined as a set of key:value pairs. It just happens to also have a particular guarantee now that iterating over the keys will return them in insertion order.
Funny, I've never expected or needed dict to be ordered. I've known about collections.OrderedDict but never really found much use for it. A dictionary is a hashtable and the order generally isn't important for typical uses of a hashtable. So, what is it that people are putting in dicts where the order of insertion actually matters?
Unfortunately ordered dict doesn't provide index access. In many cases, it's still required to maintain parallel list + dict approach, even if dictionary is ordered. List for index access and dict for fast look ups and containing the data.
> If this is useful, I wonder if an ordered set is useful.
An ordered set is a unique list. It's useful. However it's not present in Python, dicts and sets have separate implementations and sets were not moved over to the ordered implementation (because the ordering was initially a side-effect of a change in implementation which was not considered useful or advantageous for sets).
Imagine a trie implemented as a tree of dictionaries (ignore thinking about whether a tree of arrays may actually be better for now). Let’s say you want to implement an autocomplete algorithm based on a dictionary of words from a corpus. The autocomplete algorithm is naive: if a user types in “abc” you recommend whichever word starting in “abc” has the most occurrences in the corpus.
With ordered dicts you can use the trie to compute your autocomplete very quickly with no additional data structures. There are lots of other trie operations which are more efficient with ordered dicts too. I’m sure someone can come up with less complicated examples but this was the first one I thought of
Well, you can convert any sorted dictionary to an insertion order dictionary by copying it over, but that does the make the original example pretty clunky.
I guess a better use case is if you needed some sort of priority/FIFO logic along with indexing. Hard to think of something that doesn’t feel contrived, but I guess imagine having a queue where determining membership or updating fields of an element based on ID can be done in O(1) for any element?
I've used TreeMap in Java many times. I don't recall exactly why I used it, I think at least some of the cases were simply that it was convenient to be able to store an ordered list of two different types without having to go through the trouble of making a full blown class that holds those types.
But how does ordered map(self-binary search trees) handle hash-only key? I mean you can compare lexicographical order with character strings or numbers, but not a class instance right?
It'd be nice if using collections.OrderedDict could elicit a warning. Is there a tool like flake8/autoflake for flagging things that used to be Pythonic but no longer are?
I had firsthand experience with this otherwise useful feature masking an obvious bug. In one codebase there was function along the lines of (for calling into MSSQL with it's "peculiar" stored procedures):
Of course this is obviously wrong and on Python 3.6 this will break horribly on first use, while in Python 3.7 this worked as long as the keyword arguments were in the correct order.
On 3.6 it will work, it's just relying on an implementation detail (dicts are ordered in 3.6, it's just that it is considered an implementation detail, not a guarantee)
A bit of a side topic, but when Judge Alsup (himself a descendent of Boole and with a middle name of Haskell, and a programmer to boot) said "It is so ordered" (https://casetext.com/case/oracle-america-inc-v-google-8) in the Oracle Google case, I could only think that's what a computer scientist would say when quicksort was done.
I've written a lot of Python, but more Java. This is where I have a gripe with "batteries included." In Java, I'd have to think slightly about this, then use a LinkedHashMap. It's been in Java since *2002. It also has a Set flavor. Python just doesn't have as rich of a collection of included data structures, and the APIs are more limited.
Nowhere because so far the core team has remained unconvinced. People have brought up replacing standard set implementation by a variant of dict's (even expressing surprise that that wasn't already the case) but no dice yet.
You call that a failing, and others call that a feature. I love the fact that I haven't had to think about ridiculous nit-pick level data-structure trade offs. In python you have a very basic set of data structures, and short of doing something ridiculously novel or exotic, that's all you need. It's part of the beauty of the language.
I agree, and I think it's troublesome that dict now means something more specialized than it did before and the language doesn't communicate this. The dict is dead, long live the dict.
I suspect it's the result of my Java goggles, because the distaste comes from a sense that this is analagous to if the Map interface suddenly always meant LinkedHashMap.
Well, JavaScript Map preserves insertion order when iterating entries, and as of ES2015 (i.e. ES6), for-in also iterates
the (string, non-numeric) properties of an object in insertion order.
Is the ordering there at least fixed, or can it vary from run to run even on the same data, or between implementations or versions of the same implementation?
EDIT: to clarify, I was asking about JavaScript there.
It is deterministic, but it's an odd rule-set. For example:
`var o = { 3: 'foo', 1: 'bar', b: 'baz', a: 'quux' }` Will yield:
`{1: "bar", 3: "foo", b: "baz", a: "quux"}`.
Numeric keys get sorted. String keys are insertion-order. If key order is a priority, a Map should be used instead. Often when JavaScript's objects are used and a particular sort order is required, an accompanying (sorted) array is used.
Go do `var o = { 3: 'abc' }`
and then
`o[1] = 'def';`
Now tell me the output of `o`. (Doing this in your console should suffice). Is that insertion order?
If the ordering is determined by a hash function and it's always the same it is likely a defect in the implementation, because keys can be easily chosen to cause massive performance degredation; essentially a DoS attack. That's why many dynamic languages these days use things like siphash with a random key.
I'll argue that having the order depend on the hash function, rather than insertion order, is the problem. It's bad to have unnecessary nondeterminism in software. It makes testing and reproducing bugs unnecessarily difficult.
These collections are often default defined as unordered collections; it’s just that people discovered over time that some were iterable even if the language docs said otherwise.
This post is massively popular despite talking about a feature we had since Python 3.6, in 2016, that was posted on HN at the time and that is featured in most popular tutorials.
A good reminder that most of the world doesn't revolve about my favorite language. And that information is not that fast to spread.