Hacker News new | past | comments | ask | show | jobs | submit login
How Python 3 Should Have Worked (2012) (aaronsw.com)
101 points by tchalla on Jan 4, 2014 | hide | past | favorite | 76 comments



I agree with this 100%.

Python 3 is a disaster. The problem is they made the entire thing out to be a Big Deal, but they didn't really offer any compelling reason to upgrade. I mean, the unicode is... kinda better, and iterators are a bit improved, but couldn't those things have been point releases? 2.8? They basically said that python 3 was a new language, and then offered no significant reason you should use this new language. So we all kept using actual python, quirks and all. To me, as a python developer, Python 3 is a failed fork. Harsh but true.

IMO, if they were going to do that sort of thing, they should have had at least one killer feature. Like maybe if python 3 had been based off pypy they could be saying "Look! We're 5x faster! Want to upgrade now?". That would have been compelling. But their message was: "we cleaned up some stuff that most of you don't care about, and broke a bunch of things". Think about pitching that sort of upgrade to your boss. "Well it doesn't solve any of our problems, and it creates a ton of new ones, but it's the right thing to do because it makes some code slightly cleaner arguably! Convinced yet?"

If I were in charge of python, I would do this: announce python 4, have it be based on the pypy interpreter, and keep compatibility with python 2 the language while reforming the C extension APIs to be more future proof (for getting rid of the GIL and so on). (Or maybe get rid of them entirely and just have people use CFFI.)


"... they should have had at least one killer feature"

Or if Python 3 had included reliable (and updated) package and environment managers; and/or a default GUI framework (QT maybe) - out of the box.

The fragmentation between Python 2 and Python 3 is killing the language. Not to mention the community. Python needs a united front.


> a default GUI framework

Tkinter is Python's "default GUI framework". At least, that's what they continue to claim (https://wiki.python.org/moin/TkInter), and it's the GUI library I ran into first when I first learned Python.

How does it look and feel once you get started? Well, let's just say it's a bad sign if a GUI framework website has no screenshots. Even with increasingly hacky theming engines layered on top, it still is hard to get anything feeling close to native: http://tktable.sourceforge.net/tile/screenshots/macosx.html

Go ahead, you can start laughing...

(for anybody that is saddened by the above, there are thankfully binding libraries for Qt and Wx, both of which do get you fairly decent cross-platform widgets from within Python, and either of which would be better default GUI libraries in 2014.)


People have been making the argument that QT or Wx should be the default for a while, the problem is and remains A: licensing or B: complexity of shipping "batteries included" distributions. For A, PyQT's license might surprise you: [1]. Python can not distribute this with the rest of the essentially-BSD-licensed Python distribution. For B, for instance, none of the linux distributions particularly want "Python", which is often a base requirement, to pull in either of Wx or QT, both of which are quite sizable, and require their own stack of other things to come piling in too.

It's sad, but I'm not sure how to resolve the problem, and nobody else has been either in the past 10 years.

[1]: http://www.riverbankcomputing.com/software/pyqt/license


The solution is to stop shipping with batteries included and let people install via pip. Python could have an official opinion on what the best libraries are (might not be such a bad idea, actually), but they shouldn't bundle them unless they're very, very stable, tested and almost unchanging.

Requests might make a good addition to the standard library at some point. PyQT definitely shouldn't though.


There's a new alternative to pyqt, though its name escapes me now.


PySide (thanks lambda) appears to be LGPL; "better" than PyQT, but still not shippable in the core Python distro without changing the license of the core distro.

I think that's as "good" as a QT binding can be, too; QT itself is LPGL (or commercial license).


PySide, though PyQt still has certain advantages if you can live with the licensing (which is either GPLv3 or a £350 per developer commercial license).


> "... they should have had at least one killer feature"

I already commented on it and I agree. Python 2 is great and Python 3 is a little bit greater. But just not better enough to switch.

> Or if Python 3 had included reliable (and updated) package and environment managers;

I remember being at Pycon and distutils2 was announced. There was an aura of coolness, enthusiasm and hope in the air. Someone asked "but what about RPM packages?" they laughed at him and Tarek hid behind the podium in a funny gesture. Years later I am still using the default old included distutils package and building RPMs with it.


Unrelated:

You do know that pypy also uses GIL.

CFFI is to call C functions from Python. It won't allow you to run C extensions for CPython (something should implement Python C API).


I do know that; although they've done some interesting work with STM recently and hopefully that works out.

I remember in the past hearing about someone actually submitting a patch to cpython remove the GIL about 10 years ago, but it was rejected because it made the language about 2-10x slower. IMO, that decision was INSANE. I have a 16 core machine right now and it can only really use 1/16th of it. Nobody uses python for it's speed, and people that need it are probably using C extension libraries like numpy, so if you were to cut the language's speed in half for a short term I'm going to say that pretty much nobody would care, especially if it were to solve a problem that everybody hates. Ruby is already about 10x slower than python on most of tasks, and it's just not a big deal. GVR is optimizing for the wrong thing.


Have you tried multiprocessing? My experience is that people seriously overestimate how expensive processes are, especially on Unixy operating systems.


Not everyone's needs are the same. I'll not say that my concerns are better than anyone else's (of course), but they're certainly different than yours.

If Python were 10x slower than it is, I'd use Perl. When I started using Python for real (2004), I'd have stuck with Perl if Python had been 2x slower. (Note that I very much dislike Perl, and thought of Python as a nicer Perl when I first learned it.) I care about speed, and I also care about convenience. Python has a very nice balance of the two for me.

On the other hand, I only use threads in one script that I've ever written (and the GIL isn't a problem with it). So I just don't care about the GIL.

(Edit: re-worded second paragraph)


> Ruby is already about 10x slower than python on most of tasks

No it isn't - both are roughly in the same ballpark. Even before YARV it wasn't anywhere near an order of magnitude in the general case.


It's also really off putting for beginners that try to learn the language. Python 3 is being served as the main download when you search for it. And while you search for tutorials, most of them are in python 2 - and 80% of them DO NOT state whether they are for python 2 or python 3 ( because most were made during the python 2 times? ).

So people try to learn to code with python 3.x and get frustrated because simplest things don't work.

They should seriously rethink the whole 3.x thing. I must say, introducing it probably did more harm than good for the future of this language.


This is the major result when you Google "python download":

http://www.python.org/getit/

It is the same as this page:

http://www.python.org/download/

That page gives fairly equal weight to the two versions (I guess that could have changed over time).

Edit: It might make sense to have a warning about matching the interpreter version up with the tutorial, but clear wording for it is not obvious to me.


People tend to click first links with way higher frequency than those located below.

Additionally, when a beginner sees this page, he sees two versions - of which one is 3 and one is 2 - I think most of people will choose the "newer" ( newer = better? ) version because they don't really know about the differences between the two.

The second part of the problem is the fragmentation of tutorials on other websites - which can't be fixed by changing the download page.


The reason for the "getit" page was, at least at one point, because the "download" page was blocked throughout China.


This strategy makes a lot of sense to me. I really like the "deprecated warnings" to "explicit failure" transitions, IMO this works really well, though I only have experience with it at a library level using semver. So cant say much about point #1, except that this is essentially what you're doing when you test agains xlib-head branch.

Aaron implies that this approach was not taken with Python (non-py guy here), could someone tell me what the reasoning was behind that? To much legacy code in Py2 code base, just wanting to start from a clean slate, or what?

EDIT: (to add another question :) could you efficiently accomplish what Aaron is talking about, what would be the best way to go about this? @pak would you really have to load 2 stdlibs, is there no (efficient) way around syntax errors?


There are pieces of Python 3 that are syntax errors in Python 2. And there are Python 2-isms that are valid syntax in Python 3 but have a different interpretation. It's not as simple as importing certain features (which creates a sort of language version hybrid).

The idea to allow one project to switch between Python 2 and Python 3 for individual files is more interesting, but practically speaking would lead to sort of a mess.


>The idea to allow one project to switch between Python 2 and 3 for individual files

Yes, I believe key parts of the standard object model changed between the two (e.g. strings vs bytes, many of the magic methods and operators) making this nearly impossible. Every time objects would pass back and forth, they'd have to be converted, which is wasteful and bug-prone (and this is a whole mess of library code that the python3 guys probably did not want to write). You'd also need to load two different standard libraries, which would waste memory.

You only need to scan through the upgrade feature list to see how hard intercompatibility would have been. http://docs.python.org/3.0/whatsnew/3.0.html

Although I totally agree with Aaron that this would have allowed people to actually use Python 3 without fear, anything short of forcing the entire program and all of its modules to run in v3 mode as opposed to v2 mode would have been a disaster from a reliability and technical design standpoint. And that's closer to how things actually went down with 2to3, etc.


thanks for that link, now I see the problems Text Data vs. Unicode alone would be an enormous overhaul. Though the syntax changes dont seem that problematic.


They aren't. And there are automatic tools for converting between them (2to3 and 3to2), along with those __future__ imports that Aaron mentioned: doing "from __future__ import print_function, unicode_literals, absolute_import, division" would give you most of the Python 3 syntax changes in Python 2. The "everything expects bytes" to "everything expects text" change is the biggest hurdle for a lot of projects.


This link is posted from the [dead] winstonian:

http://python-future.org/


I think the strategy Aaron talks about makes a lot of sense. I especially like the idea of simply shipping future interpereters that can work with both 2.x and 3.x code. Seriously, it makes it even more dead-simple to get started with Python 3.

We can extend Aaron's ideas to even more radical ideas, for example, instead of allowing Python 3 and 2 code to be mixed on a per-file basis, allow it to be mixed on a per-function basis. In fact, allow running Python 3 code, and drop in "backwards incompatible" blocks inside of a function to let you program things that will be backwards-compatible. In other words, let people program in Python 3 as much as they want, but allow them a way to use libraries that only support Python 2 without making a mess. I'm not saying this will be easy at all, but it will definitely make Py3k adoption actually happen.

On the meta level, I'm really glad people are now discussing how to get the Python 3 rollout happening, because we really are dangerously close to having a "dead" language in Python if nothing changes.


This misses the point of why Python 3 was invented: Unicode.

Python 2's string handling is broken in the presence of unicode characters, often leading to subtle errors that wouldn't cause exceptions until far away from the place where the error was introduced, and oftentimes didn't produce exceptions at all, just wrong data. Strings were defined as sequences of bytes, and then provided a .decode method to convert them to a unicode object that stores them as a sequence of codepoints. The problem was that a large number of libraries (including all of Aaron's that I've looked at) used str as their internal string type, which meant they were storing a sequence of bytes in an arbitrary encoding but not storing the encoding along with it. If you pass such a library a string in a different encoding, it will happily store it, manipulate it, and concatenate it with other strings. If you pass such a library multiple strings in multiple encodings (like, for example, if you're pulling data from multiple webpages), you will get garbage data that can't be decoded in any codec.

Python 3 changes this so that str stores unicode codepoints and there's a separate 'bytes' type for uninterpreted bytes, and you are supposed to decode your bytes into strings at system boundaries. This is recommended software engineering practice for anyone who builds large systems that have to interact with foreign-language text; however, a large number of Python developers work in English-only environments where anything they receive will automatically be ASCII. They've never tried to track down subtly broken encoding issues; for them, the decode step is extra busywork that seems pointless.

The reason the Python2->3 transition has been so painful is that it involves a whole language ecosystem fixing bugs in their software, but the bugs are subtle enough that the vast majority of people doing the work will never have encountered them.

You can't just use the "from __future__ import python3_unicode" support because this is a change to the semantics of an existing language feature. In Python2, a string is a sequence of bytes. In Python3, a string is a sequence of unicode codepoints. What happens when a Python3 program calls a Python2 library with a string object? Do you try to auto-convert the strings? You can't, really, because strings in Python2 don't specify their encoding; you have no way of knowing which codec the Python2 library meant, because chances are they didn't think about it.

The other major change in Python 3 - iterators everywhere - is similar, and it's a recognition that an increasingly large proportion of the programming ecosystem lives in a world where async operation is important and many concurrent activities may be happening at once. And I'm really glad to see Python willing to take on these challenges even with 5 years of short-term pain, because it shows a commitment to keeping Python relevant for the issues that 21st-century programmers will face. An increasing number of software platforms will have to deal with non-English text; an increasing number will need to handle concurrent, event-based environments. Without these changes Python would basically cede these areas to languages like Go or Javascript that have their unicode story straight and are well-adapted to async programming.


Ok but... you could do unicode in python 2, it just wasn't ideal. The problem was, python 3 doesn't actually solve most peoples actual day to day problems.

Here are real problems with python:

* It's slow (excluding pypy)

* The C interface sucks (compared to something like Lua) and holds back language progress

* It can't handle multicore well outside of multiprocess hacks (which are sold as "the right way" -- bullshit. Sometimes threads are useful).

* Lambdas/closures are unnecessarily limited (I don't buy the whitespace/syntax argument -- look at how Boo works. You can do this just fine while keeping it pythonic).

* Explicit "self" is stupid and most people hate it. Javascript and Ruby are comparable languages, and neither of them need this while still having the exact same flexibility as python.

* (Down somewhere near the bottom:) strings should probably be unicode by default.

Python 3 doesn't solve any of the first five major problems, and the last problem can be worked around in python 2.

You've correctly identified problems with python 2, but I think you're incorrectly giving them more weight than they deserve. Most people just don't run into those issues, and don't care, and that's why python 3 is dead in the water -- because it doesn't solve the real pain points of python enough to make people want to upgrade.


I agree with your general point -- I haven't switched to Python 3 because it doesn't really solve anything I need solved at the moment. I would incur a significant time (=money) penalty for converting all my code to use it without a perceived benefit. I would have to get say a 30% speed increase or a 30% code # of lines decrease to get of my butt and start converting to Python 3.

I want to in principle, don't get me wrong, but I just don't have the time and money to do it, especially at the opportunity cost of not doing other product related stuff.

> * It's slow (excluding pypy)

Don't agree. Python is good enough for me. Point being "slow" and "fast" are just invitations for flame war without a specific benchmark or use case.

> * The C interface sucks (compared to something like Lua) and holds back language progress

As other post mentioned, try python cffi. That one is pretty good.

> * It can't handle multicore well outside of multiprocess hacks (which are sold as "the right way" -- bullshit. Sometimes threads are useful).

Meh, this is often parroted. In what I do (network and server io stuff) threads work very well!

> * Explicit "self" is stupid and most people hate it. Javascript and Ruby are comparable languages, and neither of them need this while still having the exact same flexibility as python.

Completely disagree. This is a terrific feature. I hate implicit hidden defaults and assumptions. All the other langauges have an implicit this/self Python makes it explicit. That is a good thing in my book


>I hate implicit hidden defaults and assumptions.

Your implicit hidden default is my handy abbreviation.

Imagine if we didn't have contractions in English.

For that previous sentence, is it really at all frightening that the "didn't" meant "did not" but we haven't (ha) written it out? Could it possibly mean anything else? Wouldn't English be all the more stilted and ugly if there were only one explicit and verbose way (a Python design principle) of negating verbs?

If I redefined the "self" as any other name Pythonistas would hate on me for being unidiomatic. Most editors highlight "self" in anticipation of what it means. In fact, I have never seen code where specifying another name would be justifiable. The only conclusion from this situation is that it is a de facto keyword and re-specifying it every. single. time. is a waste of programmatic breath.

And for this reason, every time I tab between languages and get my favorite "function takes 1 argument (2 given)" error, (by the way: a completely bewildering message for a programmer from every other language, rereading the method definition and wondering where the second argument is coming from [1]), I mutter curses at my terminal for Python relentlessly carrying this wart into its third decade.

Implicit and explicit are relative to expectations. Python moves the implicitness into the way arguments are passed to methods. It is not any more explicit when placed in the context of all other OOP languages, but rather an eccentric convention.

[1]: 800 results on SO. Read them and weep. http://stackoverflow.com/search?q=Python+takes+arguments+giv...


I'll agree with this entirely. As a newcomer to Python, I found the explicit "self" argument to be confusing. And now I just find it irritating. It breaks convention with every other OO language; it's additional meaningless noise in method definitions; and most importantly it violates the convention used by nearly every other programming language I'm aware of, which is that method definition argument lists and method call argument lists should match up.

When method call argument one is method definition argument two, things get confusing quickly. What's the benefit to readability? Python has plenty of other implicit rules (semantic whitespace being the most troublesome--though ironically its primary benefit of readability runs counter to the visual noise of the explicit self arg), and as pak points out, explicit "self" in the method def just moves the implicitness elsewhere in the code.


I like the explicit self. It makes the language much simpler, clearer and less "magic".


> It breaks convention with every other OO language

Lua, Go, and Rust all have explicit self.


Just for the record, you can do cool stuff that you couldn't do (as easily) if you had implicit `self`. Here's an example[1] that I'm moderately proud of.

[1] https://github.com/MediaCrush/PyCrush/blob/master/pycrush.py...


This just seems like flamewar material, much in the same vein as programmers (brogrammers?) that get red in the face over Python cramping their style by forcing structure via whitespace.


It's not flamewar material. GVR himself has acknowledged everything (edit in response to @EdwardDiego: most of what) I said. (http://neopythonic.blogspot.com/2008/10/why-explicit-self-ha...) The main contention holding back GVR and 3.1 from fixing this once and for all is that decorators currently need to manipulate the explicit self. From my perspective, it seems like Python decorators are [ab]used to reimplement basic features in other OOP systems, like class methods, and there are elegant ways of getting around this given a shift in certain concepts (like class << self in Ruby). It's odd to me that they were willing to break so much code with 3.0 (by "fixing" strings and Unicode) but a couple decorators held this one back.


> GVR himself has acknowledged everything I said

I hate to be picky, but he only acknowledged portions of what you said.

> but a couple decorators held this one back.

@classmethod and @staticmethod are somewhat important language features to not break.


>@classmethod and @staticmethod are somewhat important language features

And strings aren't? Whatever, I understand that there are tradeoffs and to people that only use Python, the aesthetic blemishes tend to matter less. To people that use it in the context of other languages, it sticks out like a bad paint job or a prominent stain. This, and many more opinions than I could express succinctly, are on the reddit thread for GVR's post.

http://www.reddit.com/r/programming/comments/79h9y/guido_van...

@redditrasberry: To make a bad analogy, it's like a stain on your carpet in your front hallway. If you live with it for long enough you won't even see it any more. But to visitors coming to your house it's the most obvious thing. And it's particularly noticeable because the rest of Python is so nice - it's like I'm visiting an art gallery and everything is beautiful and pristine, but there on the carpet at the entrance is this huge stain that nobody has ever cleaned up.


> And strings aren't?

Not sure how explicit self breaks strings? You've lost me on the way I'm afraid.

If we're quoting Reddit comments at each other:

> You may be used to a different kind of magic. Perhaps the magic of a variable called this appearing inside your method, or the magic in which an un-prefixed variable name somevar is sometimes a local but at other times a member variable this.somevar. So it may take a little time to get used to Python's style, but isn't it only because you are used to the other magic?

Explicit self is hardly a burden after about 10 minutes of learning Python, and there's really little objective argument to be made in favour of or against it as opposed to everything else.

You quoted JavaScript as 'not needing it' before, and dear God, JS is the worst of all languages to reference, given how the this keyword is a calling context, which every man and his framework takes liberties with.


Re: strings, I was referring to what I said already and you quoted only part of it:

> It's odd to me that they were willing to break so much code with 3.0 (by "fixing" strings and Unicode) but a couple decorators held this one back.


How much code did the new strings break, exactly? What code it may have broken most likely benefits from the new unicode treatment, it was the worst thing about Python strings.

Removing self would be a far more massive change - it wouldn't just impact certain strings, it would impact every Python class.


>> GVR himself has acknowledged everything I said

> I hate to be picky, but he only acknowledged portions of what you said.

Read it and for me he did not acknowledge anything at all.


They said they weren't going to fix everything with Py3.


Please do not confuse dislike of semantic whitespace for structure-less preferences. I want my blocks to be explicit so the computer can know the difference between between incorrect indentation and a block ending. I also do not want to make tabs versus spaces any more flamewar inducing than it already is.


Don't get me wrong. I use python professionally. I like it. I know about CFFI. I'm just saying, it has its warts.


> I'm just saying, it has its warts.

That statement is too broad to be useful though. All languages have warts.


To be fair, he did list a number of specific warts above.


> Explicit "self" is stupid and most people hate it. Javascript and Ruby are comparable languages, and neither of them need this while still having the exact same flexibility as python.

I love explicit "self", are unfamiliar with these "most people" you refer to, and Javascript makes my eyes bleed. The "most people" community I'm familiar with runs from javascript like a burning building, with an honorary mention to its exquisitely horrendous behavior w.r.t. "this".


> Sometimes threads are useful).

I would never in a million years look at a performance problem and given these two options:

1. Write the critical section in a faster language in serial (ie, rather than dynamic interpreted script, maybe compiled bytecode, or maybe even native machine code). 2. Write the critical section multithreaded in the script language.

I would never think to use #2 first. I would always just move my CPU bound code into a tiny C++ library and only worry about threading as a matter of last resort. You get so many huge leaky problems (even if you got rid of the GIL you would be looking at variable synchronization, atomic timings, and cache coherency) from going to multithreading it is never worth it over just writing the same code section in native code and using a native call API, even the default way of writing your native code with Python.h involvement.


Your argument is essentially that one can write slow things in Python and fast things in C, and that this solves the majority of problems (let's even be generous and call it 98% of them that fit neatly into these two categories). The trouble here is that "the number of programs" is a large number, and 2% of a large number is still large, and leaves important classes of programs without a good solution.

One class of programs left out in the cold is the network server. Now a network server must respond to a large number of requests. From a basic software engineering perspective, a 16-core machine should be responding to AT LEAST 16 requests at once (much more if some of the requests are IO bound). So the network server needs some kind of parallel processing (whether threads, subprocesses, or whatever you want to suggest). Under your philosophy, programs that need threading (thus all network servers) should not be written in Python, but somebody should be stepping down to C. While it is probably true that a very small minority of network servers should not be written in Python, the broader claim is absurd; you should be able to write reasonably-performing network servers in Python with relative ease. It is, after all, a server-side language; "writing a server" should be very high on the list of "things you can do".

Now more broadly, the existence of greenlet, Twisted, gevent, and their popularity (we're talking top-100 packages here) speak to the fact that there are a LOT of python programmers who have threading-related requirements. Are they on crack? Now mix in the new standard library stuff like asyncio (3.4) and threading is clearly an important enough issue to get major attention from the core committers. Are they on crack?

Now you might operate in a world where every time you need threads is an isolated case and it's fairly simple to drop down to C. But there are a lot of people (in absolute terms; I don't know if they are in the majority) where when they want threads the right solution is to use threads.

The thing I hear from the core committers whenever the GIL comes up is "if we worked on the GIL, we would be taking lots of time away from more important things." But when you look at the things they work on instead--unicode, iterators, ordered dictionaries, argparse, etc.--plenty of people in this thread are insufficiently motivated to upgrade. Are ordered dictionaries really more important than GIL work? To me, the answer is clear. I would rather have some progress on the GIL problem than every single py3k feature combined.


So here's some perspective on the concurrency problem. I write network servers for a living - most are in C++ or Java, but I would love to be able to use Python.

There are a number of high-level approaches you can use to concurrency. Shared-nothing processes. Threads and locks. Callback-based events. Coroutines. Dependency graphs and data-flow programming.

They all suck, and they all suck in different ways. Processes have large context-switching overheads, and take up a lot of memory, and require that you serialize any data you want to communicate across them. Threads and locks make it very easy to corrupt memory if you forget a lock, very easy to deadlock if you don't have a clear convention for what order to take locks in, and ends up being non-composable when you have libraries written under different such conventions. Callbacks require that you give up the usage of "semicolon" (or "newline") as a statement terminator; instead you have to break up your program into lots of little functions whenever you make a call that might block, and you have to manually manage state shared between these callbacks. Coroutines requires explicit yield points in your code, and opens up the possibility of a poorly-behaving coroutine monopolizing the CPU. Dependency graphs also require manual state management and lots of little functions, and often a lot of boilerplate to specify the graph.

Python has a "There should be one - and only one - obvious way to do things" philosophy, and with asyncio, Guido seems to have decided that the obvious way for Python is going to be coroutines. It's an interesting choice, and he's not alone in that - I recall Knuth writing that coroutines were an under-studied and under-utilized language concept that had many desirable properties. Coroutines free you from having to worry about your global mutable state potentially changing on every single expression, and they also give you the state-management and composition benefits that explicit callbacks lack.

There are parts of them that suck - like having to explicitly adds "yield from" at any blocking suspension point, and having to propagate that "yield from" down the call stack if added to a synchronous call. But having written a bunch of threaded Java server and (desktop) GUI code, a lot of callback-based Javascript, and a lot of C++ in both callback and dependency-graph style, all of those models suck a whole lot as well.


While your approach works most of the time, there are many cases where there isn't such a thing as a 'critic section' or a 'hotloop'. Performance relevant sections can be spread out into hundreds of places and reimplementing all of them can be less efficient than a complete rewrite. Guessing the performance characteristics of complex programs (e.g. not just computational in nature) is hard, and therefore so is the language choice.


Well sure, I wouldn't go with #2 either, but the GIL blocks you from doing fast processing in your C thread while the rest of the program continues because it's locked the entire process. So whether you write native code or not, you're stuck to one processor.


No, even Python's own C code releases it when doing things like IO.


There're already huge discussions on explicit self, FFIs, and speed, so I'm not going to address those. I largely agree with you on both counts - I'm annoyed by them as well, but don't think that my opinion is going to change the Python maintainers'.

However, I'll point out that they're following a very sensible "stop the bleeding" maintenance strategy in targeting Unicode and async first. Whenever you need to upgrade a massive old codebase to a new way of doing things, your first priority needs to be to stop things from getting worse. That means getting everybody onto the new system, either via a shim or (worst case, but true in this case) by fiat, and then cleaning up the older code and introducing new cool stuff that's enabled by the new features.

The big problem with Python 2's Unicode handling was that it made things easy on people who handled only ASCII and then dumped the problem on top of people who wanted to handle Unicode and still rely on libraries from people who didn't know what Unicode was. As a result, the latter people just didn't use Python, because of the pain involved. Guido and the core maintainers (correctly, IMHO) identified this group of people as important to the future of Python, and so they want to make things viable for them even if it's inconvenient for a large section of Python's existing userbase.


Agree 100% I would add the awful standard library (including its abysmal documentation). For instance the HTTP client API, it must be the most awful API I've ever seen, and Urllib2 is no improvement at all.


You are right, this is exactly what's wrong with Python. I use Python as my "daily driver" in terms of programming but I have some pet peeves, which you enumerated for the most part.

Special emphasis on the multicore stuff. It's just dreadful. I tried to build a complex processing system on top of `multiprocessing` and friends but just had to give up because it doesn't work half the time for obscure reasons (here's a fun one: you can't call a function from a `multiprocessing.Pool` if the function had been defined after the pool was instantiated. wat.) I just drank the kool-aid and used Celery, which is awesome, but I shudder every time I think about how many sacrifices must've been made to make it work.


Insofar as C interfaces go, look at cffi.


I think its a bit funny that you praise Lua's C API in one point just to bash explicit self in another one.


These all sound like reason to use Lua over Python then (and especially LuaJIT for FFI).


The Unicode solution chosen by Python 3 required a major backwards incompatibility, but that wasn't the only possible solution to Python 2's Unicode problems. The biggest problem in Python 2 was simply that the implicit conversion between bytes and characters used the ascii codec and would fail if any non-ASCII characters were found. Python 3 chose to solve this problem by forcing the developer to specify a codec explicitly whenever this conversion happened. If instead they had simply changed the default codec from ascii to utf-8, many of the ubiquitous Unicode-related bugs in Python 2 code would have simply been fixed in a mostly backwards-compatible way. This wouldn't have fixed the problems for people who commonly use non-UTF-8 encodings, but that seems to be rare and getting rarer (in my experience; my understanding is that the biggest remaining exception is the use of UTF-16 on Windows).


This! Exactly this. The stupid decision to use ascii as the default codec (and to disobey the system locale while doing so!), combined with silent Unicode coercion, doomed Python 2 to a lifetime of i18n pain. Just changing those two behaviors would have provided a much more evolutionary path forward without opening the Python 2 to 3 chasm.


This needs more upvotes. You're exactly right.


Let's assume that you're right, that the main reason for Python 3 was Unicode. I've dealt with Unicode both with and without language support (in Python and elsewhere), and this is a pretty important feature. I think that the world in general, and the Python world in particular, is better off with Unicode support in the built-in str class.

That said, other languages -- most recently and notably, Ruby -- have managed to make the Unicode transition without leaving much of the existing user base behind. I realize that Python is often used in larger and slower-moving organizations than Ruby, and also has a more conservative philosophy. Even so, perhaps they could have announced Python 2.8, identical to 2.7 in every way except that strings are now Unicode. Then, after 1-2 years of everyone getting their Unicode house in order, we could have moved onto 2.9 or 3.0, and/or used the "from future" syntax.

There is no perfect solution, but it seems to me that by breaking so many things at once, we're in a situation where everyone wants to upgrade, but no one sees the overwhelming benefit from doing so, and thus puts it off even further.


Personally I dislike how Python3 is infected with Unicode. It introduces quite some complexity (and probably inefficiency) when all you want to do is work with bytes. Even if I am actually dealing with UTF-8, I should have a choice to treat it transparently, and basic logic says that it should be easier not harder than working on code point level. In particular what I dislike most is the complete removal of formatting on bytes (% and .format()).

I also believe that dealing with unicode more transparently is a very reasonable strategy. There are not many places where you actually have to deal with code points. Text formatting and rendering is one of those few places. Having the file i/o and standard streams do encoding/decoding by default goes against that design.


It's exceedingly rare that you need a sequence of codepoints. As evidence, consider that the native string types in many languages with pervasive Unicode support, such as Java and C#, do not provide a codepoint sequence.

Programs generally need to deal with textual data at one of three levels:

1. Manipulate strings and substrings. e.g., "does string x contain substring y"? Byte sequences of utf8 data are fine for this. You may need to normalize the data first, but that's true of codepoint sequences as well.

2. Deal with a small number of specific ASCII codepoints--e.g., the filename separator character. Again, utf8 byte strings are fine for this.

3. Deal with glyphs, including glyphs composed of multiple combining characters. You need to iterate to assemble the glyphs, so a codepoint sequence offers no advantages over a byte sequence of utf8 with a codepoint iterator.

A sequence of bytes is just fine as a native string type. Store Unicode data as utf8-encoded byte strings. Provide easy iteration over codepoints in utf8 strings, transcoding for I/O, normalization functions, etc. Call it a day.


The problem isn't access to individual codepoints, it's mixing strings with different encodings without any way of tracking the encoding or knowing that you're doing something wrong.

UTF-8 everywhere works great when you can enforce it. On the level of an individual project, you can enforce it. On the level of a language ecosystem, you can't, and you need to. Otherwises you end up with some libraries who assume their internals are UTF-8 encoded strings, some libraries that assume they are ASCII strings, some libraries that assume the caller is handling encoding issues and will take care of ensuring that it's all UTF-8, some that make no assumptions at all and carry around the encoding everywhere, and some that just haven't thought about the problem and silently break when you use them with data that came from other libraries.

It's interesting that Go and Java - both languages with mature Unicode handling - still have a distinction between uninterpreted bytes and UTF-8 or UCS-2 text. In Go, you have separate []bytes and string types, even though Go strings are just UTF-8 byte sequences. In Java, you have byte[] and String. The problem is not a technical one of how to represent strings, it's a social one of how to get all the library authors on a language to agree on a convention of how to handle encoding.


> it's a social one of how to get all the library authors on a language to agree on a convention of how to handle encoding.

Ok, but the solution isn't to make everyone rewrite their libraries by holding the future ransom if they do not.

It could have been real simple: let python track encoding when it's known and throw an exception if they mix oldstr(unknown) with newstr(py3).

In py3 modules newstr is str, and in py2 modules oldstr is str. Now you only have to fix the bits where they mix and the programmer can always choose to fix it in the new code.


This is the first time I've ever seen someone write with an understanding of combining characters, glyphs, codepoints vs encoding of said codepoints - and yet arrive at this conclusion.

What's the largest codebase you tried a unicode-ification project on? It's a nightmare unless you keep de/encoding as close to the i/o operations as possible.

I can't understand how you've ever found it just as easy to do "string x contain substring y" on bytes vs uc strings. Any case-insensitive test will fail miserably unless you only ever see ASCII input. Then there's sorting and tokenization. Oh god, the sorting bugs...

Even measuring the length of string is a miserable fail. And blind substitution of utf8 bytes horribly mangles the output causing mysterious segfaults or silent corruption.

On a large codebase, programmers can't keep track of what encoding is being used in which parts of the code. Eg. Let's allow the users to specify input file encoding! But our OS does filenames in UTF16-LE. And the Web API is UTF-8... nasty stuff. It's far saner to use character strings everywhere except immediately after/before I/O operations.


About Unicode: It might be right, that this one would not be easy provided in compatible way. But what they did, was worse. They made it unnecessary difficult to change from Py2 to 3 -- one reason were the literals. In Py2 it was common to use u" for Unicode literals. Py3 did not accept that syntax at all, since every literal was Unicode -- logical, right? Wrong! Just stupid! Thus laying one roadblock for Py3. Many Unicode programs had thousands of places with these literals (syntax error!) -- change every one, and even worse, when you change it, then Py2 will not accept it any more, because in Py2, literals without u are not Unicode any more (and will raise a different error!).

In Py3.3 or so, they at last detected the problem and fixed at least this one ... but very late! Many programmers might have turned their backs to Py3 already.


It's possible to just force everyone into unicode strings in a way that doesn't cause too much trouble. It's impossible to try to do this gradually via a future import, and it's impossible to work out all the potential scenarios where things would behave differently, and particularly the performance characteristics can change and pretty wildly at that, but it sounds like this still, as painful as it would be, sounds like a better approach than P2->P3.

Basically, all strings would internally be represented as unicode (OR as a byte array + an encoding, that might be a little too ambitious though), and it has 2 APIs that it can be accessed by. non-instance related operations (such as making new strings) similarly has 2 APIs. pre-unicode-switch code gets an API that emulates as best as is possible the behaviour as before, and post-unicode-switch code gets the full unicode API as exists now for unicode strings.

Even if encoding is flat out broken, if you put in a stream of chars and store that as <whatever>, and then later you query this thing, the broken usually ends up undoing itself. Yes, this will fail spectacularly once you try to for example concatenate a string with broken encoding to a string with a different encoding, but presumably places that are international proof have switched to unicode strings long ago, and places that aren't really aware of the importance of anything that isn't in ASCII 32-127 will find all their code to magically 'just work'.

Similar things should be possible for the iterator business. Internally things are iterators, but if ever any P2 code touches them, they turn into a (memory hogging, etc) list. yes, of course, this will flat out break if you attempt to pass an infinite iterator to code that isn't used to it, but these are all transitional pains, and the key point is: As long as you don't fork the pre-big-switch version (beyond security updates), sooner rather than later libraries will fix the bugs, or their community of users will just die out as someone else writes a new one to fill the void.

Painful? yes. very. We've seen this in programming land before. Java5 introduced generics and as a result any interaction with pretty much any library written before it resulted in a cavalcade of unsafe/raw warnings. It sucked. Huge communities stuck to 1.4 (IBM WebSphere notably stayed there for almost half a decade before moving on to 1.5).

But, today? Libraries have upgraded or have been replaced. The chance you still run into pre-generics code is tiny. It took long, it sucked, yadayada, but for all intents and purposes that was java's Python2->Python3, and everyone is on Python3 at this point. In the same time span as P2->P3, roughly speaking, and that seems like it's nowhere near complete.


> 3. In Python 2.c, warnings begun being issued when you tried to use the old way, explaining you needed to change or your code would stop working.

> 4. In Python 2.d, it actually did stop working.

I'm curious, what are some examples here? The `from __future__`s that I recall offhand are `print_function`, `division` (still need to be explicit) and the old `with_statement` (incompatible code broke as soon as this became default).

The other old way of doing things that I can think of offhand is `catch Exception, e`, which has been replaced with `catch Exception as e`, but not removed.


there is `unicode_literals` too.

For the full list, see http://docs.python.org/2/library/__future__.html

And there is one forgotten in that list: from __future__ import braces


So it looks like in Aaron's list, 2.a == 2.c and 2.b == 2.d, and the process has been completed only three times, and two of those times were just adding keywords.


quote from python.org: "New to Python or choosing between Python 2 and Python 3? Read Python 2 or Python 3." "Which version you ought to use is mostly dependent on what you want to get done."

No other language I have ever worked with speaks like this. "New to Rudy? You could use version 2.1 or why not try version 1.0, it still works ok"

Python.org treats version 2 and 3 as completely different things, newbies to the language like myself don't see it as a update to Python 2.x because thats not how it's sold to us.


I think the only people nthat hate python 3 is the one that have worked with it for several years, newbies wouldn't even care a little.


This aaron guy likes to talk a lot of shit so it seems




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: