Hacker News new | past | comments | ask | show | jobs | submit login
Python Lists vs. Tuples (nedbatchelder.com)
79 points by joshbaptiste on Aug 19, 2016 | hide | past | favorite | 122 comments



So, I've long since argued the importance of what Ned terms the "cultural difference", or the difference between dynamic homogeneous structures (lists/relational columns), and static heterogeneous structures (tuples/relational rows). There are important performance ramifications regarding the use of one or the other, and conflating the two (i.e. dynamic heterogeneous) is a strong design smell. Some languages "get it", others don't.

Python is not a language that "gets it", despite the article's examples. Python chooses the much-less-useful distinction between "immutable sequence" and "mutable sequence"; the lack of any other distinction is borne out by the standard library (as dozzie points out here [1]). Furthermore, the issue of immutable vs. mutable is totally orthogonal to that of dynamic/static and homogeneous/heterogeneous. There's no reason that "cultural" lists need be mutable, or "cultural" tuples need be immutable. So Python's choice doesn't make sense from this angle either.

For example, OCaml has both immutable lists and immutable tuples; its type system enforces exactly the "cultural difference". Similarly with C, which has both mutable lists (arrays) and mutable tuples (structs), both of which can be rendered immutable (const). Why Python chose lists (and only lists!) to have both immutable and mutable flavors, with otherwise identical semantics, is beyond me.

Erlang used to "get it", but they lost the thread when they came out with heterogeneous syntax support for a homogeneous structure (map). TLA+ totally conflates tuples with lists, but its type system is rich enough to enforce the difference (and as a bonus, allows for the elusive static homogeneous structure).

[1] https://news.ycombinator.com/item?id=12323849


But be careful not to obsess over it. I've seen some worry about the runtime complexity of collections in Python for a handful of elements.


Oh, of course. In Python's case, I'd advocate to just get rid of tuples and be done with it. (Maybe replace it with a nice immutable adaptor class; I'm sure one already exists.) It's a dynamically typed interpreted language, neither performance nor type-correctness is its strong point.


> In Python's case, I'd advocate to just get rid of tuples and be done with it.

That would be a bad idea for at least two reasons: sometimes you want containers to be immutable, and mutable containers can't be used as dictionary keys.


You mean in Python or in computer science?

Because I'm pretty sure that it's possible to use mutable objects as keys in general. The users just have to observe restrictions: like don't insert the keys into the dictionary and then mutate them while they are in the dictionary and expect the dictionary to still work fine.

(Objects can also be used as keys even if they are mutated while in the dictionary, if identity-based equality is used for the dictionary; i.e. two keys are equal if they are acually one and the same object, and not just two similar objects.)


> Because I'm pretty sure that it's possible to use mutable objects as keys in general. The users just have to observe restrictions: like don't insert the keys into the dictionary and then mutate them while they are in the dictionary and expect the dictionary to still work fine.

In other words, mutable objects don't work as keys, and pretending they do requires unenforced conventions to preserve the integrity of the dictionary abstraction.

> Objects can also be used as keys even if they are mutated while in the dictionary, if identity-based equality is used for the dictionary; i.e. two keys are equal if they are actually one and the same object, and not just two similar objects.

The fundamental problem is very simple: dictionary keys have to be values, not objects, and Python simply doesn't have compound values. There is no such thing as “the list [1,2,3]” in Python: you just have one or more objects whose current state happens to be the sequence 1,2,3. The very next moment, their states might be different sequences. Object identities are first-class values in Python, but identities are primitive and indecomposable - not compound!


This is a Python specific restriction. Ruby for example allows mutable keys in dictionaries.


So, which is the case?

(0) Object identities are used as keys? (e.g., two different list objects that contain the elements 1,2,3 are treated as different keys)

(1) Catastr... I mean, hilarious things happen if you mutate an object that's being used as a key?


I'm not sure if you are asking about Python, or Ruby.

Anyway, both languages use value-based hashing for dictionaries. I am not a Rubyist, but I guess you are just expected to not mutate the keys? Ruby's dictionary implementation also has a rehash method to deal with this case [1], if for some reason you are purposely mutating the keys I suppose...

[1] http://docs.ruby-lang.org/en/2.0.0/Hash.html#method-i-rehash


> I'm not sure if you are asking about Python, or Ruby.

Ruby, since you said it allows mutable keys in dictionaries.

> Anyway, both languages use value-based hashing for dictionaries.

Neither Python nor Ruby has compound values. Object identities are primitive values.

> but I guess you are just expected to not mutate the keys?

I guess C programmers are expected not to use memory after they free it.


But if Python got rid of tuples, it would be a radical change; in that case it could as well allow lists to be dictionary keys in the same stroke.


Yes, hence my suggestion to introduce an immutable adaptor class. Then all containers (sets, dictionaries, etc.) gain those abilities.


> hence my suggestion to introduce an immutable adaptor class

What does this gain you over just doing tuple(obj) (or the equivalent--tuple(d.items()) for a dictionary, for example) when you want to make it immutable for use as a dict key?


It can be implemented copy-on-write. It can have more compact storage if the runtime is smart. It can continue to efficiently support lookup (e.g. tuple(d.items()) would not).


> It can be implemented copy-on-write.

Which only makes sense if the underlying mutable object is mutated. But if the underlying mutable object is mutated, the adaptor instance, if it is immutable, can no longer have the same value as the underlying mutable object. So the "adaptor" is no longer an adaptor; it's just a separate immutable object with the value that the mutable object had originally. I.e., it's just the same as tuple(d.items()) or whatever.

If by "copy-on-write" you just mean that two different containers that both store the same immutable value, say a string, can both point to the same memory storing that value, and the mutable container's pointer only changes if a new value is stored at that key/index, Python already does that; for example, tuple(d.items()) does not create new storage for the items themselves, it points the tuple to the same objects that are contained in the dict at the time of the call to d.items().

> It can have more compact storage if the runtime is smart.

A tuple in Python does have more compact storage than a list, dict, or set.

> It can continue to efficiently support lookup

Same issue as above; this only matters if the underlying object is mutated, in which case an immutable "adaptor" can no longer have the same value, so lookup on the adaptor isn't the same as lookup on the underlying mutable object. Lookup on the adaptor is just lookup on the immutable value, which is easily made efficient.


It's worth noting that tuples aren't actually immutable in Python. Tuples have immutable length, but the values in those slots can be changed at runtime.


No, they definitely can't:

    >>> (1,2,3)[2]=5
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'tuple' object does not support item assignment
I wish they could. (Maybe Python 3 is different?)


You can't replace a value in a tuple, but you can modify it, e.g.,

  >>> x = (1, 2, [3])
  >>> x
  (1, 2, [3])
  >>> x[2].append(4)
  >>> x
  (1, 2, [3, 4])


...Or can you?

  import ctypes

  def tuple_setitem(t, index, item):
      obj = ctypes.py_object(t)
      item = ctypes.py_object(item)
      ref_count = ctypes.c_long.from_address(id(t))
      original_count = ref_count.value
      if original_count != 1:
          ref_count.value = 1
      ctypes.pythonapi.Py_IncRef(item)
      ctypes.pythonapi.PyTuple_SetItem(obj, ctypes.c_ssize_t(index), item)
      ref_count.value = original_count

  >>> x = (1, 2, 3,)
  >>> tuple_setitem(x, 1, "foo")
  >>> x
  (1, "foo", 3)
Disclaimer: for entertainment purposes only, do not try this at home!


You haven't changed anything about the tuple. The holds a reference to a list and will always hold a reference to that list.

What you did was modify the underlying list, the tuple itself is immutable.


> You can't replace a value in a tuple, but you can modify it

If it has mutable items in it, yes. But in that case, the tuple can't be used as a dictionary key; Python checks, not just that the tuple itself is immutable, but that every object in it is immutable.


Tuples are immutable so that they can be used as keys in hash tables.


But a sibling comment to yours by shoyer argues that mutable objects can be elements of a tuple. A tuple whose elements are mutable is effectively mutable.

Can (1, 2, [3]) be used in a hash table, and what if the [3] is mutated to [3, 4]?


When it tries to hash the tuple, it finds the list inside of it, realizes the tuple is mutable, and throws an error.


I use them as the semantic guide when I write code.

Then again, I also use single and double quotes as a semantic guide to mean different things. I feel like that would probably make most programmers want to throw the computer at the wall, but so it goes


I do the same! But I get frustrated that I can neither modify individual elements of my "cultural" tuples, nor is there an immutable (functional) "cultural" list (or dictionary, for that matter).


what, of course, single quotes are for characters and double quotes are for strings.


C programmer identified.


What do you use single and double quotes for?


I can't speak for GP, but I like to use single quotes for symbolic strings ('go_left', 'parent', 'alpha-campus', some_map['fixed_key']) and double quotes for textual strings ("An error occurred", "package/__main__.py").


Sometimes this convention falls apart, e.g. when constructing HTML tags it is easier (and more readable) to write:

    return '<a href="{}">{}</a>'.format(url, text)
than

    return "<a href=\"{}\">{}</a>".format(url, text)
Especially for a lazy developer like me... :)


Why not just use single-quotes in the HTML?


Fair. But I would consider those to be unformatted text, not symbols.


Oh, I like that idea: compensate via a convention for Python's lack of a symbol data type. :)


If you mean Ruby's symbols, Python has a data type with that set of properties. It's called 'str'.

If you want the behaviour of Ruby strings, you might use a `bytes` or `bytearray` type, or you might use a list of chars.


I thought Ruby ensured there was only one symbol with given contents, so that equality testing for symbols reduces to pointer equality. I don't Python strings behave that way.

But at any rate I wasn't thinking of performance characteristics but rather the stereotypical uses of symbols vs strings in languages that have both, so you are right that one can use Python strings in place of symbols, as SpaceManiac already pointed out above.


> I thought Ruby ensured there was only one symbol with given contents, so that equality testing for symbols reduces to pointer equality. I don't Python strings behave that way.

In Python, string literals are reused, if that is what you meant.

    >>> a = 'abc'
    >>> b = 'abc'
    >>> a is b
    True


This isn't guaranteed by the language, even if CPython does it. What is guaranteed is intern('abc') is intern('abc').


String literals up to a certain size are reused.


Oh, cool, I didn't know that. I think in Ruby all symbols are reused, whether they are literals or constructed programatically. (Unlike Common Lisp symbols which can also de 'uninterned' if they are not literals.)


Yup I do the same.

some_dict = {'key': "value"}


> TLA+ totally conflates tuples with lists, but its type system is rich enough to enforce the difference (and as a bonus, allows for the elusive static homogeneous structure).

1. TLA+ doesn't conflate the two at all. On the contrary: it has neither. TLA+'s data is based on set theory, and has syntax sugar for mathematical sequences, defined in TLA+ as functions from a prefix of the natural numbers to some set. I don't know what either "lists" or "tuples" are mathematically (in terms of set theory), and neither does TLA+. Like ordinary formal math, TLA+ has only sets (and functions, which are theoretically also encoded sets). Everything else can be defined in those terms. If, you come up with a mathematical definition of either or both, you can use it in TLA+. If the definition is the same for both, they won't will be the same; if it isn't -- they won't.

2. Like "ordinary" math, TLA+ has no type system at all. But you're right that any program property can be described in its simple logic. E.g., you can have a sequence whose all elements are natural numbers, except the prime indices, which are strings:

    ∀ n ∈ DOMAIN seq : seq[n] ∈ (IF IsPrime(n) THEN STRING ELSE Nat)
Of course, for simple cases of constant-length sequences, you can have:

    seq ∈ Nat × STRING × Real
or whatever.


> Why Python chose lists (and only lists!) to have both immutable and mutable flavors, with otherwise identical semantics, is beyond me.

It didn't. There's frozenset as well. There's been discussion of an immutable dict every so often, but the conclusion is generally that it's unnecessary.

The reason to have an immutable sequence is mostly to serve as dict keys. Same with an immutable set. If you use a mutable sequence as a dict key, you create ambiguity over whether you care about the memory location or the value of the sequence.



I didn't know about frozenset, thank you!


Can you say more about the important performance ramifications?


Dynamic homogeneous structures (lists) have uniform structure, but must be able to resize dynamically. The former helps performance (run-time type info can be hoisted out of the structure), the latter harms it (extra runtime machinery needed).

Static heterogeneous structures (tuples) have a non-uniform structure, but need not resize. The former harms performance (run-time type info must be kept for each element, though usually few); the latter helps it (most accesses can be compiled to simple pointer arithmetic).

On the other hand, dynamic heterogeneous structures (Python's lists) do not have uniform structure, and must be able to resize. So run-time type info must be preserved (adding memory/access overhead), and accesses can't be optimized at compile time (more runtime machinery). It is to avoid these issues that things like numpy and JavaScript's typed arrays exist.

There is almost never a reason to use a dynamic heterogeneous structure, so they are best avoided. A dynamic heterogeneous collection can neither support indexing a priori known components (since it is dynamic), nor support iteration (since it is heterogeneous). (Note that a collection of objects of different types but with a shared interface is in fact a homogeneous collection of objects with that interface.)

(The final combination is static homogeneous, but these are a subset of both the first two cases, and do not occur often. The only example I can think of are sets of statistics: all statistics have the same type, but there are a fixed number of them which are known at compile time. They can be optimized very well.)


Python only has what you call "dynamic heterogeneous" structures because Python is a dynamic language. The ramifications of that are by no means limited to lists and tuples. It seems like you basically don't like dynamic languages.

Some of your comments make no sense even given the above, however. For example, Python lists and tuples certainly support iteration.


Erlang (one of my favorite languages) is dynamically typed, yet did not (until the recent addition of maps) have dynamic heterogeneous structures. You cannot iterate over or resize tuples, and the static analyzer discourages heterogeneous lists. Both are immutable (functional).

I apologize that I have hopped between use and implementation a bit in my above post. Yes, the implementation of Python lists and tuples support iteration, but it would be nonsensical to write code which iterated over a list or tuple containing heterogeneous data. (What could you possibly do identically to each element, if they had nothing in common?) Hence my assertion that implementing a dynamic heterogeneous structure is inefficient, if it nonsensical to truly use it as such.


> it would be nonsensical to write code which iterated over a list or tuple containing heterogeneous data

I agree such a use case would be rare, but that doesn't mean such a use case cannot exist.


At that point it is an argument of taste, but my belief is that those rare cases are invariably code smells and can be rewritten more clearly without the necessary run-time type introspection (or inference from some other information). Better to restrict them entirely IMHO.


I hope it is obvious that Python values are members of a single static-type, this massive sum-type that combines a type-tag with a value, summing together all of the possible types-in-the-python-sense. That is to say, from the perspective of static typing, dynamic languages are uni-typed.

In languages with explicit static types and explicit Tagged Unions, it's not unreasonable to iterate across a homogeneous sequence of tagged unions, all of the same type. For example, Option<Int>. This is just a particular instance of the general pattern of iterating across a heterogeneous sequence, introspecting on whether a given value is of type Int or of type NoneType. Unless you want to allege that Null is a member of Integer (in a sane and healthy type system).

The point of this is that I'm skeptical that these are "invariably code smells". For example, our Option<Int> values might come from a SQL field that is a nullable integer. It seems entirely fair to iterate across them.


Maybe it is not clear from my other comments but we are in violent agreement :) Tagged unions (and even more general RTTI) have a common interface and therefore are homogeneous: a well-written iteration always accesses such elements first through the tagged-union (or RTTI) interface.

To me, code which acted on such elements without first consulting the common interface would be "smelly". e.g. code which assumes every other item is a string, or code which uses a list as an ad-hoc structure. There's almost always a better way to write such code.

(As an aside, I find the explicit use of RTTI a smell in and of itself, but one unrelated to containers. Code which today does two different things to some input, based on whether it's an integer or string, likely tomorrow will need to do three different things, two of which involve an integer. Tagged unions are almost always preferable.)


We are mostly in violent agreement, yes. :)

The nuance where I disagree with you was the point of my comment. I claim that there's no principled distinction between tagged unions and RTTI. (Except that of course usually implementations that refer to themselves with the phrase "tagged union" are a little more, uh, sophisticated, in exactly the kind of way that your comment about "likely tomorrow will need to do three different things.." alludes to eventually having a need for.)

Unless you can draw a principled distinction, then you can't on the one hand say "sequences of tagged unions are homogeneous" and on the other say "python sequences are potentially heterogeneous, and if you have a tuple of heterogeneous data and you iterate on it, that's invariably a code smell". Can you?

I mean, unless by "code smell" you mean "thing that's worth a double-look, because although it might be good code, it's easy for it to be evil code". In which case, we're in total agreement.


> it would be nonsensical to write code which iterated over a list or tuple containing heterogeneous data.

There are Python functions that take any type like repr() or id()


See my other comment; it does make sense to have a list of objects which share some interface. In the case you suggest, it is a very trivial and not-very-useful interface, but an interface nonetheless. You can do this with objects even in strictly-typed languages like OCaml or Mercury; code which iterates over the list is simply restricted to use only those methods which all elements have in common.

Of course in languages which don't support a universal "object" type (say OCaml or C) you cannot do this -- numbers and strings have no interface in common. Those languages take advantage of this by not storing type information in collections (e.g. OCaml's float arrays, or any array in C), thus resulting in more compact memory footprints. If they instead allowed heterogeneous collections, the memory footprint of such collections would double.


> There are Python functions that take any type like repr() or id()

Umm.. I haven't had to use any of them on a list(or for that matter on an object) for code.. (except when I'm curious and poking at python internals) May be I'm stuck in a web application role? Is it actually useful to say call repr() on a list of objects? May be call id to see if there's references to same object twice? I'm finding it hard to imagine cases where it would be needed to implement a feature, that wouldn't be better served by alternate methods.


One could argue that for those, the semantic differences between the types are dropped. In OO lingo, they look at the properties of the common base class.


I feel like your reply here, and perhaps a few other comments, are confusing a few senses of the word "dynamic" (which, to be fair, is i.m.o. a reasonable confusion).

Python is "a dynamic language" mostly in the sense that it is "dynamically typed". (I'm sure you're aware, I'm not trying to patronize, just setting the stage.) That's one sense.

Tuples in Python are still dynamic in this sense, of course, because every damn value in Python is, for good or for ill. But there are at least two other senses in play here.

Compare C structs with, say, JS Objects. Both let you mutate the values they point to, but JS Objects let you add new property-names at will (or remove them). So one sense of "dynamic" is "resizeable".

We could also have a data type, e.g. linked lists in various "purely functional" languages, that can be resized (pushed and popped), but existing cells cannot be mutated. AIUI Clojure is full of this sort of thing.

Point #1 is that nothing in colanderman's comments argues against Python's dynamic typing.

My second and broader thought is that I agree with colanderman that it's weird that Python conflates these latter two senses of dynamism (resizeability and cell-mutability), and even more weird that it does so for one data type (the sequence) and more or less ignores it elsewhere (though e.g. frozendict does exist).


There's nothing stopping you from writing Python types that separate these different senses of "dynamic". You can write a sequence type that doesn't let you change the number of cells, but does let you mutate existing cells. Or you can write one that lets you add new cells, but doesn't let you mutate cells once they are added. You can even write them as C extensions if you want the same speed and memory efficiency as the built-in types.


I get most of what you are saying, except you say that Python tuples must be able to resize. Python tuples can't resize, they are immutable. Lists are over-allocated to help with append operations, but tuples are not, because they have no append operation.


Yes, you are right about that. Edited.


"here's no reason that "cultural" lists need be mutable, or "cultural" tuples need be immutable."

er, yes there is. if a list is not mutable, doing list.append(something) in a loop would re-initialize the list every loop iteration. the same reason you shouldn't add to a string in a loop, as strings are imitable.

to me, an immutable list is not a list.


Functional data structures are used a little differently from their imperative counterparts. Instead of saying

  list.append(x)
you would say something like

  list = list.withLast(x)
thus assigning the newly constructed list value back to the variable 'list'.

There is a small literature within computer science on efficient implementations of functional data structures. While some tricks are involved, I think it's safe to say it's a solved problem, thanks to people like Chris Okasaki and Phil Bagwell.


Have you used a functional language? (Lisp, OCaml, Erlang, etc.) Lists are immutable, yet appending items is O(1). That is achievable with even a basic singly-linked list implementation.


Of course, some other common operations on singly-linked lists, notably concatenation, are O(n).

There are better choices for general-purpose functional lists.


Lists aren't immutable in Lisp, though. More precisely, Lisp doesn't have lists: it has chains of mutable cons cells.


They're not technically immutable, but culturally Lisp programmers are not in the habit of mutating lists unless they know where and for what purpose the list was created. Mutating some list whose provenance you don't know -- maybe you got it by calling some library function that doesn't explicitly promise to return a freshly consed list -- is a bad idea, and experienced Lisp programmers know this instinctively.

So in practice you can often get away with pretending the lists you create are immutable, if that's what you want to do.


Culture isn't something I can take into account when reasoning about what a piece of code means. The actual semantics of the programming language in question is all I really have.


I doubt that. I think it would take too long to work out the meaning of every piece of code you see from first principles. I think any competent programmer has a large collection of unconscious heuristics they use to understand code. Learning to write code in such a way that the reader's heuristics will not mislead them is, I think, a big part of becoming a software engineer.

I was just trying to describe the state of practice in the Lisp world. I do agree that Lisp benefits from more functional data structures -- that's why I wrote FSet [0]! I would point out, though, that since Lisp lacks any kind of access control, there's nothing besides common sense and convention that prevents FSet client code from mucking about inside the guts of the FSet data structures. It would be a stupid thing to do, and everybody knows that, but the language doesn't prevent it. So the situation is not that different with FSet collections than it is with lists of unknown provenance.

Even in languages that nominally support access control, like C++ and Java, you can usually still get around it with casting or reflection. Most people know that they should play games like that only in very specific circumstances.

[0] https://github.com/slburson/fset


> I think it would take too long to work out the meaning of every piece of code you see from first principles.

Just as in mathematics, once you've proven something, you can reuse the result as often as you want. But “the community always does this” isn't a proof.

> I do agree that Lisp benefits from more functional data structures (...) there's nothing besides common sense and convention that prevents FSet client code from mucking about inside the guts of the FSet data structures.

Then Lisp doesn't have functional data structures. That's no surprise, Lisp has no more compound values than Python or Ruby do - that is, none.

> Even in languages that nominally support access control, like C++ and Java, you can usually still get around it with casting or reflection.

Standard ML actually has type abstraction, and, once you've made something abstract, there's no way clients can inspect it. No downcasts, no reflection, no anything - abstract means abstract.


> Then Lisp doesn't have functional data structures. That's no surprise, Lisp has no more compound values than Python or Ruby do - that is, none.

Ah, the arrogance of youth.


It isn't arrogance, it's just the semantics of the programming languages in question. Programming languages being such fundamental tools, you would think every programmer should understand them well, but that is far from being the case. It is tempting to blame programmers, but the real culprit is the subtlety and depth of the topic. Two great resources to get started are https://www.amazon.com/dp/0262201755/ and https://www.amazon.com/dp/0262690764/

Anyway. A value is something you can bind to a variable. Unfortunately, you can't bind the list [1,2,3] (or whatever syntax you might prefer) to a variable in either Python or Lisp, because those languages simply don't have lists (or any other kind of compound value).


I read back through your recent comments. I notice you get downvoted a lot. Does this bother you?


I don't particularly mind.


In my understanding, in LISP-59, the only list-modification symbols are FIRST and REST. This makes me want to claim that in the one true Lisp, there are no mutable cells.

Actual useful implementations of Lisp, of course, may vary.

edit: s/list/Lisp/ in a particular place


Does it make sense to distinguish between “this 2” and “that 2”? Of course not! And indeed Lisp doesn't let you make that silly distinction.

Yet somehow it lets you distinguish between “this (list 1 2 3)” and “that (list 1 2 3)”. Distinguishing between copies of the same value isn't “useful”, it's a pain, because you need to jump to a metalanguage (say, Hoare logic) to get the values you actually want.


> Does it make sense to distinguish between “this 2” and “that 2”? Of course not! And indeed Lisp doesn't let you make that silly distinction.

It does, actually, if the integer is large enough:

  > (ash 1 100)
  1267650600228229401496703205376    ; 2 ^ 100
  > (eq (ash 1 100) (ash 1 100))
  NIL
This is why you should use 'eql', not 'eq', to compare integers. But again: Lisp programmers have to know that. We're back to culture and convention.

So much the worse for Lisp, I suppose you would say, and perhaps you have a point. Lisp is not Haskell.

I don't entirely disagree with you; it would be nice if Lisp had an immutable list type. (Like FSet seqs [0].) But again, in practice, it's not as much of a deficiency as you seem to think.

[0] https://github.com/slburson/fset


> It does, actually, if the integer is large enough: (snippet)

Wow, this is even more broken than I previously thought.

> Lisp is not Haskell.

Haskell is broken too, only in a different way: Since when does it make sense to have a value that's simultaneously a number, a string, a function and pretty much anything else imaginable? (`undefined`)

> But again, in practice, it's not as much of a deficiency as you seem to think.

It causes difficulties when you want to reason rigorously about your code.


> It causes difficulties when you want to reason rigorously about your code

That's what he said. In practice it doesn't matter.


> In practice it doesn't matter.

I reason rigorously about my code in practice.


I still haven't fully grasped what aspect of Haskell means you can't rigorously reason about your code.

I'm pretty sure you don't mean space leaks, and I'm pretty sure you don't mean global coherence of type classes, which are the two most popular complaints about reasoning about Haskell.

Other than that, can't you just say "this piece of code is at least as defined as the equivalent in a strict language" and leave it at that? What can go wrong?


> I still haven't fully grasped what aspect of Haskell means you can't rigorously reason about your code.

Oh, that particular criticism wasn't aimed at Haskell, but rather at languages without compound values. And I never said you can't reason rigorously about code written in such languages. You can, but it isn't very pleasant.

> I'm pretty sure you don't mean global coherence of type classes, which are the two most popular complaints about reasoning about Haskell.

Far more serious one: Haskell doesn't have a type whose value inhabitants are the integers and nothing else. I would like to say “even Lisp has that”, but nope, it doesn't either: https://news.ycombinator.com/item?id=12328095


> Haskell doesn't have a type whose value inhabitants are the integers and nothing else

Sure, I get that, but I don't get the practical consequences for reasoning. In Haskell, if you write

    myIntFunction !x = expression
then within expression x is an integer and nothing else, the same as if you'd written

    let myIntFunction x = expression
in OCaml. So I understand how it differs theoretically. I don't understand how it differs practically.

What's an example of something you wanted to prove in Haskell but couldn't, because it wasn't strict?


> Sure, I get that, but I don't get the practical consequences for reasoning.

(0) In a language without compound values, you need to manually prove that object states faithfully represent the compound values you care about.

(1) In a dynamically typed language, you need to manually prove that computations produce the classes of values that their continuations expect.

(2) In a lazy language, you need to manually prove that variables will never be substituted with nonterminating computations.

> What's an example of something you wanted to prove in Haskell but couldn't, because it wasn't strict?

All the usual type class laws, without the pesky addendum “assuming no free variables are ever substituted with bottom, globally”. (In all fairness, strictness alone doesn't fix the problem either. At least not when laws have free variables of function type.) Redefining what variables can be substituted with is redefining the language's operational semantics, which in turn is effectively creating a new programming language. You would have to prove all the usual things from scratch: that type checking is decidable, that the type system is sound, etc.

In practice, I can still prove the things I want, but the proofs themselves are less modular in a lazy language. The problem manifests itself when you build large computations from smaller ones:

(0) In a strict language, whether any of the subcomputations terminates is a local issue, and doesn't affect the other subcomputations.

(1) In a lazy language, failure of one computation to be productive may be observed as a termination problem in a different computation.


I don't.


> Wow, this is even more broken than I previously thought.

I don't agree [#]. Providing a predicate that tests for identity of implementation enables a useful class of optimizations. For example, if you know two sets are implemented by the same object, you don't need to compare their contents to know they're equal. The fact that the information can be misused doesn't make it useless. Non-leakiness of abstractions isn't an unalloyed good.

In fact, because all computations take time and space, no abstraction can be perfectly leak-free.

[#] Except if you want to make the point that the Lisp builtin function currently called 'eql' should have been called 'eq', and the function currently called 'eq' should have been called 'identical-objects-p' or some such long and explicit name. Formally this makes no difference, of course, but as a practical matter it makes erroneous usage of the latter much less likely and easier to spot.

> It causes difficulties when you want to reason rigorously about your code.

In practice, these are relatively minor. Consider for example my FSet collections library. One of the preconditions in the contract of each FSet operation is, of course, that the invariant required for correct operation holds on the object(s) on which the operation is invoked. Suppose we have proven, for each operation, that it correctly implements its contract: if the invariant holds on the argument(s), it holds on the return value.

So now we have an application, a complete program that uses FSet as a library. Among the many things we need to prove is that the preconditions of each FSet operation used in the application are satisfied. Because we've already proven that the FSet operations maintain the required invariant, we need to prove that nothing else in the program breaks that invariant. In a language with unbreakable access control, this would be trivial: no other code has access to the internals of these data structures. In Common Lisp, we have a little more work to do. We have to show that no other code invokes the FSet internal accessors by name; a few calls to 'grep' suffice for this. Then we have to show that there's no chance the code uses Lisp's reflective features to invoke these accessors indirectly: specifically, that no calls to 'intern' exist in the program that could return a symbol that names one of these accessors. This is probably easy. If the program calls 'intern' at all, it probably passes a constant string as the name of the package to intern into, so all we need to show is that that string is not equal to "FSET". If the package name is not constant, this starts to get a little harder, but (a) this is a rare case, and (b) if we can't easily prove that the possible package names come from a small, restricted set, the program probably has a security vulnerability and needs to be fixed anyway.

That's nonzero effort, I concede. But on the scale of all the things we have to do to verify our application, it's fairly trivial.

By way of contrast, let's look at what happens in languages where the default integer type has bounded precision. Now, to figure out the range of inputs over which some procedure satisfies its desired contract, we have to work backwards from each integer operation, looking at the range of values that would cause it to overflow, and figuring out what constraints that places on the inputs. This task can be quite difficult in complicated cases, and even in simple ones, the need to do it is easily overlooked (as shown by a famous example [0]).

I note that even your beloved SML doesn't mandate arbitrary-precision integers [1]. I can't even find, on its Web site, anything saying whether SML/NJ has them or not. Perhaps they are now considered a de facto requirement for a usable SML implementation -- in which case, you see, culture matters.

Look. Formal verifiability is a concern of mine too, and I understand where you're coming from. But this is not a black-and-white world, and practical considerations are also relevant. While language (mis)features do sometimes contribute to the difficulty of verification, most of that difficulty, I strongly suspect, comes from the complexity of the code itself. We could verify machine language programs if we wanted to, and though it would be harder, I can't believe it would be overwhelmingly harder.

[0] https://research.googleblog.com/2006/06/extra-extra-read-all...

[1] http://sml-family.org/Basis/integer.html


> I don't agree [#]. Providing a predicate that tests for identity of implementation enables a useful class of optimizations. For example, if you know two sets are implemented by the same object, you don't need to compare their contents to know they're equal. The fact that the information can be misused doesn't make it useless. Non-leakiness of abstractions isn't an unalloyed good.

You can have identity equality in Standard ML (resp. Haskell) too, it's just a matter of using the right types: a `foo` (resp. `Foo`) has structural equality, but a `foo ref` (resp. `IORef Foo`) has object identity equality. More precisely, equality is always structural, and identity equality is the right notion of structural equality for reference cells. So absolutely nothing is lost w.r.t. what you have in Lisp or Python. And you gain actual compound values.

> In practice, these are relatively minor. Consider for example my FSet collections library. One of the preconditions in the contract of each FSet operation is, of course, that the invariant required for correct operation holds on the object(s) on which the operation is invoked. Suppose we have proven, for each operation, that it correctly implements its contract: if the invariant holds on the argument(s), it holds on the return value.

Have you ever actually proving things about code? It's far from a “minor thing” when every operation has unchecked preconditions that you need to manually verify every time you call a function. The only reason why I can prove things about my code at a reasonable scale is that I arrange things so that I have to manually prove relatively little, and the rest automatically follows from type safety and parametricity.

> (next paragraph)

You use the word “probably” a lot. An argument that uses the word “probably” can't convince me that a proposition is true in all of the cases.

> I note that even your beloved SML doesn't mandate arbitrary-precision integers [1].

Hand-rolling big integers using machine integers as the starting point is easy in Standard ML. The final step is very important: Using abstract types, I hide the representation of big integers from the rest of the program. As far as clients care, they are using big integers, not lists of machine integers, vectors of machine integers, or whatever. Unfortunately, this final step is impossible to perform in Lisp.

> I can't even find, on its Web site, anything saying whether SML/NJ has them or not.

I don't use SML/NJ, because it has a lot of broken extensions that are difficult to reason about, and it uses a questionable external language to automate builds, rather than SML itself. Instead I use Poly/ML and Moscow ML. Both have big integers.

> Perhaps they are now considered a de facto requirement for a usable SML implementation -- in which case, you see, culture matters.

Big integers are great help, but they aren't considered a de facto requirement. The only requirements are those in the Definition of Standard ML and the Standard ML Basis Library. As I said above, using abstract types, you can roll your own non-leaky big integer abstraction.

> While language (mis)features do sometimes contribute to the difficulty of verification, most of that difficulty, I strongly suspect, comes from the complexity of the code itself.

And the complexity of the code itself comes from the discrepancy between your problem domain and what you can readily express in your programming language.

> We could verify machine language programs if we wanted to, and though it would be harder, I can't believe it would be overwhelmingly harder.

Well, you are overwhelmingly wrong, unless you hand-wave away a significant chunk of your proof obligation.


I don't understand what this has to do with whether or not lists are mutable in lisp.

It's possible that you were misled by a silly typo in my comment, which I have just fixed.

With that in mind, could you clarify?


> I don't understand what this has to do with whether or not lists are mutable in lisp.

Lists aren't mutable in any language. A list is a sequence of elements. A different sequence is a different list. Some languages simply don't have lists.


no, how do they manage to O(1) by making them immutable? the only way to add to a list would be to create a new one with the extra value added? I'd be very surprised if they had the same performance as a mutable list


Singly-linked lists are a fundamental computational structure. A singly-linked list is simply an item and a pointer to another singly-linked list. To add an item, simply create a new singly-linked list with that item and a pointer to the old list. Since you are not copying data, this is O(1).

Read more here: https://en.wikipedia.org/wiki/Linked_list#Singly_linked_list...

If this interests you, read also about tree structures such as red-black trees. These kinds of structures support all operations that mutable structures support in O(log n) time. Of course mutable structures can always be faster, but immutable structures are far better than linear time.


Singly-linked lists are a fundamental computational structure, and I'm not disagreeing with anything in the spirit of your comment, but I want to point out that linked lists are pretty much never the right implementation choice for an application programmer unless you're working in a language with no decent stdlib and you also care hugely about every byte of RAM/storage/etc. Which does happen, but is not very common, in practice.

For the benefit of the GP, what might be non-obvious about the way colanderman describes this, is that if you want lists that are cell-immutable but are resizeable with O(1) push() and pop(), you do your pushing and popping off the front of the list. So the pointer to the old list is still a valid pointer to its previous state. And at this point, yeah, go off and read the WP link. :)


A list is a finite sequence of elements. If you alter the sequence, you get a different list.


Great article!

One more important difference between tuple and list that's worth knowing about: syntax. Ned makes it seem like tuples have the syntax "(x, y)", but they actually have the syntax "x, y".

I find that a lot of Python programmers, even long-time ones, don't realize this important difference. It's what allows one to write code constructions like "x, y = 1, 2" or "return 1, 2".

One needs the parentheses only to "clarify" this syntax in certain cases; for example, as dict keys, the parens are necessary -- "x = {(1, 2): 2}". Although interestingly, for dict lookup, they are not -- "x[1, 2] == 2".

One way I remind myself of this rule is to realize that "x = 1," is the same as "x = (1,)", but "x = (1)" is actually the same as "x = 1". This is quite different from lists, who are syntactically defined by the "[]" brackets, and thus you require them, or a "list()" call, whenever you make lists, making it just that much more heavyweight as a data structure.


When I used to write python "x = 1," bit me way more than it should have. And I don't see why you wouldn't want to be explicit about your tuple construction every time..


I usually use x = tuple((1,) ) so that non-Python programmers on the team know what they're dealing with.


It's much simpler to me. Tuples are used when you need a sequence but don't need to modify the contents, for instance returning multiple objects from a function call. Lists are used when you want to add things to it.

Namedtuples are wonderful! It's like a class with only attributes, except you don't have to define a new class and get slapped across the back of your head by more experienced programmers for creating a useless class. Access objects by index, by name, or by iteration. It's very versatile.


If you like namedtuples, you'll love attrs. https://pypi.python.org/pypi/attrs/16.0.0


I like namedtuples, but attrs looks so wrong to me.


namedtuple is a reasonable way to define a new value type in python. unfortunately the very easy way to define a new type in python is to use `class`, which brings with it a bunch of behaviour that you often don't want or need: comparison by identity rather than comparison by equality, no default support for repr, no default support for ordering, mutability, support for inheritance, etc

that said, writing generic code that works transparently with both namedtuple values and objects can be a bit irritating: e.g. you might want to derive a new value/object from an existing value/object that updates one of the attributes of the value/object.


First para: I think maybe you're confusing 'class' with inheritance from 'object'. A good tip is to do:

    class Foo(namedtuple('Foo', (...)):
        def __repr__(self):
            ...
Your 'Foo' gets ordering, hashing, comparison, because it derives from 'namedtuple'. It gets '__repr__' too, but in the example I'm overriding that.


I would also add

    __slots__ = ()
to your class definition so that instances of Foo keep the small memory footprint benefits do namedtuple.


> The Cultural Difference is about how lists and tuples are actually used: lists are used where you have a homogenous sequence of unknown length; tuples are used where you know the number of elements in advance because the position of the element is semantically significant.

Oh, if only it were so simple! Python itself doesn't understand what the heck tuples are, in that the built-in functions often treat tuples as simply "immutable lists". See isinstance() and issubclass() functions, which expect what's meant to be a list of classes, but you need to pass a tuple. *args is another such example, as the article mentions.

If what Python calls tuples were the thing with structure instead of order, they wouldn't be iterable.


There is an additional technical difference: Tuples are hashable and can be used as keys in a dictionary.

    grid[(1,1)] = True


The article mentions this.

Also you can write this as `grid[1, 1] = True`


Interesting article. I've always heard that about lists and tuples, but this article made me wonder how much less efficient with storage is a list.

The following trivial example shows that, at least for very small lists and tuples, except for the fixed overhead of 16 bytes for a tuple vs. 72 bytes for a list they're the same [edit: no it doesn't. I wasn't looking at the diff between a 1-element and 2-element tuple]:

Perhaps there's more to it?

  >>> from sys import getsizeof
  >>> list = [94]
  >>> getsizeof(list)
  80
  >>> list = [94, 37]
  >>> getsizeof(list)
  88
  >>> list = [94, 37, 21]
  >>> getsizeof(list)
  96
  >>> list = [94, 37, 21, 19]
  >>> getsizeof(list)
  104
  >>> tuple = (94)
  >>> getsizeof(tuple)
  24
  >>> tuple = (94, 37)
  >>> getsizeof(tuple)
  72
  >>> tuple = (94, 37, 21)
  >>> getsizeof(tuple)
  80
  >>> tuple = (94, 37, 21, 19)
  >>> getsizeof(tuple)
  88


Another (admittedly trivial) experiment with speed and storage: Iterating through a 10,000 element list was about 2% slower than iterating through a 10,000 element tuple, and the list was 16 bytes bigger than the tuple:

  #!/usr/bin/python

  from sys import getsizeof
  from random import sample
  from time import time

  l = sample(xrange(10000), 10000)
  t = tuple(l)

  print "The tuple's size is " + str(getsizeof(t)) + " and the list's size is " + str(getsizeof(l))

  s = time()
  z = 0
  for x in l:
    z += x

  print "Iterating through a 10,000 element list took " + str(time() - s) + " seconds."

  s = time()
  z = 0
  for x in t:
    z += x

  print "Iterating through a 10,000 element tuple took " + str(time() - s) + " seconds."


  doug@supermicro:~$ ./x
  The tuple's size is 80056 and the list's size is 80072 
  Iterating through a 10,000 element list took 0.00589299201965 seconds.
  Iterating through a 10,000 element tuple took 0.00538778305054 seconds.

  doug@supermicro:~$ ./x
  The tuple's size is 80056 and the list's size is 80072
  Iterating through a 10,000 element list took 0.00553607940674 seconds.
  Iterating through a 10,000 element tuple took 0.00536203384399 seconds.


This code seems inconclusive -- my first three runs showed that lists are the same, slower, AND faster! (updated for Python 3.5.2 on Windows)

        C:\Temp>python listtuple.py
	The tuple's size is 80048 and the list's size is 80064
	Iterating through a 10,000 element list took 0.0005006790161132812 seconds.
	Iterating through a 10,000 element tuple took 0.0005006790161132812 seconds.

        C:\Temp>python listtuple.py
	The tuple's size is 80048 and the list's size is 80064
	Iterating through a 10,000 element list took 0.0020074844360351562 seconds.
	Iterating through a 10,000 element tuple took 0.001031637191772461 seconds.

        C:\Temp>python listtuple.py
	The tuple's size is 80048 and the list's size is 80064
	Iterating through a 10,000 element list took 0.0 seconds.
	Iterating through a 10,000 element tuple took 0.001001596450805664 seconds.


Interesting! The previous example I posted was Python 2.7.12 on Ubuntu 16.04.01. Here is OSX 10.11.5 w/Python 2.7.10 (also inconclusive with respect to speed, but OSX was half the memory):

  mb:~ doug$ ./listtuple 
  The tuple's size is 40028 and the list's size is 40036
  Iterating through a 10,000 element list took 0.00113916397095 seconds.
  Iterating through a 10,000 element tuple took 0.00111603736877 seconds.
  mb:~ doug$ ./listtuple 
  The tuple's size is 40028 and the list's size is 40036
  Iterating through a 10,000 element list took 0.00126791000366 seconds.
  Iterating through a 10,000 element tuple took 0.00133585929871 seconds.
  mb:~ doug$ ./listtuple 
  The tuple's size is 40028 and the list's size is 40036
  Iterating through a 10,000 element list took 0.00114607810974 seconds.
  Iterating through a 10,000 element tuple took 0.00117492675781 seconds.


IIRC lists and tuples are implemented nearly identically in CPython. So you won't find much difference here.

Because Python is so dynamic, you can't make lists or tuples much better. However, in a static language (like Java or OCaml), you can greatly shrink the size of "cultural" lists by hoisting the (identical) run-time type info out of the elements and into the container. (OCaml does this with floats.) And "cultural" tuples typically can use much simpler implementations, since their sizes are fixed and known at compile time. (C structs are often optimized away entirely in optimized assembly.)


I'm really surprised by these numbers. There isn't really any speed advantage or significant storage advantage to using tuples over lists. That really cuts down on the rationale for using tuples over lists, except for the "cultural difference" as discussed (and the ability to use lists as dictionary keys, which isn't something I think I've ever wanted to do)


(94) is not a tuple. (94,) is.


D'oh! Thank you, colanderman.

  >>> tuple = (94,)
  >>> getsizeof(tuple)
  64


When I taught Python, the way I explained was that a tuple is basically one value, shrinkwrapped. It's one value so that you can put anywhere that expects a value (such as a return value or a key of a dictionary). Also it's one value so you cannot change it.

The way you look at a statement like (a,b) = (2,3) is that this is a normal assignment, where you imagine there are one value flowing in the equal sign, from right to left. But it happens to contain two values within its wrap.


It's really too bad that functional data structures weren't much of a thing yet when Guido started developing Python. The language really cries out for them -- particularly with the "mutable initial value" problem and the 'set'/'frozenset' distinction.


Can you elaborate on how either of those, uh, "features" would be helped by functional "persistent" data structures?

In the case of the mutable initial value problem, that's just a choice of semantics (in my opinion, a very bad choice). Without having looked at the implementation, I would assume that if you have no backwards-compatibility concerns, a fix would take a few minutes.

In the case of sets/frozensets, most Python programmers want their sets to be mutable, in a syntax-sugar-y way that's incompatible with using functional persistent data structures.


Isn't it obvious? If the initial value is an empty functional list, set, or map (dict), there's no possibility of it being modified. The problem disappears.

> most Python programmers want their sets to be mutable

They're accustomed to mutable sets because those are all they've known. But ask a Clojure programmer how they feel about functional collections, and you'll be told they're fine. Functional collections are actually easier to use because you don't have to worry about unintended aliasing.

And functional behavior is not strange at all. No programming language I have ever used (and it's a long list) has used shared, mutable semantics for numbers. Consider, in C:

  int a = 3;
  int b = a;
  b++;
After this code sequence, 'a' is still 3, of course! Nobody with programming experience expects 'a' to be incremented at the same time 'b' is; it's not even something you stop to think about -- it's completely automatic. This is true even though I wrote 'b++' rather than 'b = b + 1' -- 'b++' could more easily be misread as a mutating operation on a shared integer object, like 's.add(x)'. However, there's no language I'm aware of in which that's what it means. (You can introduce aliasing intentionally in C++ with a reference, but this requires an explicit '&' in the code.)


The explanation is alright, but would have really benefited from a brief overview of tuple use in the C API.


The C API seems like a completely separate topic.


C and Python APIs of CPython are really just difference faces of the same thing. Due to the internal use of tuples some of the Python behavior is governed by related C behavior.

For example:

> Another conflict between the Technical and the Cultural: there are places in Python itself where a tuple is used when a list makes more sense. When you define a function with args, args is passed to you as a tuple, even though the position of the values isn't significant, at least as far as Python knows

This is the C behavior of tuple usage being exposed to the Python side. If one explores the C API it becomes clear that an implementation detail of C is being exposed to Python.

> The Cultural Difference is about how lists and tuples are actually used: lists are used where you have a homogenous sequence of unknown length; tuples are used where you know the number of elements in advance because the position of the element is semantically significant.

Here the Culture has moved away from the Technical roots of Python and now they're at odds. On the C side tuples are not* just used when the length is known ahead of time - tuples are not always immutable on the C side.

The context is important if you really want to understand.


I was asked this in an interview. This is essentially the answer I gave. :)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: