No, 32-bit code is not faster with current CPUs. x86 apps are more register-starved than they are cache-bound. I don't know who started this urban myth about x64 applications somehow being slower than equivalent ones running under WoW64, but it needs to either be backed up with specifics -- what apps are slower, and why? -- or stop being propagated.
(Source: extensive firsthand experience with performance-sensitive C/C++ development for Windows.)
64-bit Firefox is slower for some workloads because the JavaScript heap is dramatically larger (Due to the doubled pointer size). As it happens, large portions of Firefox are written in JavaScript...
I'm sure there are some use cases where it's faster, of course, but having your heap be 50-100% larger (yes, really that much, I've measured it) is pretty hard to overcome just by adding a few extra registers.
I expect this is why Java has support for using 32-bit object pointers in 64-bit mode. Features like that would probably help Firefox tremendously.
Heap growth moving from 32b to 64b is rarely on the order of 50-100% larger. 20-30% is much more common (100% would imply that your application stores no actual data, only pointers and sizes; I know that pointer-chasing is very much in style these days, but that's ridiculous. Even the 1:1 ratio implied by a 50% figure is an extreme outlier).
Further, raw heap size typically has marginal impact on performance; locality of reference and size of working set are far more important. It doesn't matter if your raw heap usage grows by 30% if the majority of your accesses are to the same working set of data either way.
At the same time, a well-crafted interpreter or JIT (which people seem to think is important in browsers) benefits heavily from increasing the size of the register set. Even in more typical use cases, even without aggressive optimization, speedups of 10-20% are common in moving from x86 to x86_64.
You can tell me what your think the behavior is actually like, but I can point you to a HTML5 game that uses 50-100% more JS heap and runs slower in 64-bit builds of Firefox :)
You say a larger heap has marginal impact on performance and that locality and working set size are what matter. If larger pointers make the heap bigger, then the working set is going to grow as well, and locality will be worse because the larger pointers mean less objects fit into a cache line.
I'm not saying 50-100% growth doesn't happen, I'm saying that it's an outlier (based on complete system data from multiple platforms).
If it's really a common issue with firefox's JS implementation for some reason, there's nothing to stop them from using a compressed pointer scheme for JS that uses 32b pointers and a known offset; it's a common technique. You give up one register to keep the offset around, but you've still netted 7 registers from the switch to the 64b ISA, and adding the offset is either a simple OR or can be folded into an LEA (nearly free or free).
Let's not confuse "they haven't had a chance to tune the 64b JS engine, and the 32b one has been worked on for years" with "x86_64 is slower than x86_32".
Is your "complete system data from multiple platforms" from JS apps?
Let's also not confuse "a sufficiently smart compiler..." with real-world observed performance :)
BTW data layout transformations like compressed fields would also be useful on 32-bit. The vast majority of object graphs would be happy with 16 or even 8-bit identifiers, not to mention JS numbers which are all 64-bit floats even on 32-bit platforms.
It might be that it's simply Firefox's JavaScript interpreter which hasn't been optimized for 64-bits. As far as I know other JavaScript interpreters don't suffer from this problem, and with LuaJIT, which should be fairly similar to a JavaScript JIT compiler, at least I have the numbers to back it up: http://luajit.org/performance_x86.html
LuaJIT doesn't have increased memory usage. 32 or 64 bit, it always packs pointers into the low bits of NaN doubles. Addresses fit because current 64 bit chips actually use 48 bit addresses.
(LuaJIT also uses a limited memory arena for allocations but it's not inherent to the design; it could be changed but the current garbage collector can't handle that much data)
Specifics: web browsers (and specifically Firefox, but it's true for WebKit-based browsers too) tend to be slower on most web workloads when running 64-bit.
There are several reasons for this.
First of all, web browsers are very pointer-chasing heavy because of the nature of web specs: the DOM is all about tree traversals, and so is CSS layout. Further, you have to share style, font, etc data across objects to get anything resembling sane memory usage, which means chasing pointers to all that data. Given that many of the algorithms involved require traversing large chunks of the DOM tree and various ancillary data structures, the upshot is that most of the really processing-intensive parts of a browser layout engine (possibly excluding graphics) are in fact more cache-bound than register-starved.
The second reason is a bit of a chicken-and-egg problem. Since most of the users are running the 32-bit builds, the 32-bit JIT has had more optimization work on it than the 64-bit one. The obvious solution is to improve the 64-bit JIT, of course.
The third reason is that even in JITTed code it turns out that cache locality matters a lot because JS objects share lots of their state across multiple objects so getting to it involves pointer-chasing. An actual JSObject in SpiderMonkey is 4 pointers and all the other things you want out of an object you reach through those 4 pointers. This reduces memory usage enormously compared to putting the data inline in the objects, and actually improves cache locality in general, but exacerbates the difference between 32-bit and 64-bit.
You're right that for heavily numeric JITted code x86-64 beats the pants off x86-32. Unfortunately most JS out there is object-heavy pointer-chasing, not numeric code...
> x86 apps are more register-starved than they are cache-bound.
Considering it has 16 general-purpose registers (twice as many as x86), I wouldn't consider amd64 register-rich. It's, at best, less starved than x86 but no match for the MIPS processors from 1981.
Actual modern x86 implementations have hundreds of registers; what they are starved for is register names. This puts enormous pressure on the register-rename and reordering logic of the processor, but Intel and AMD have turned out to be quite good at delivering there, to the point that (certain numerical algorithms aside) the relative lack of register names is usually a non-issue.
More precisely, it's the compiler (or luckless .asm hacker) that suffers from register starvation. Register renaming can avoid pipeline blockage but if the compiler thinks it's out of registers, it'll use... wait for it... memory accesses. Now you've entered the wonderful world of load/store penalties and the like, and at least some of those cache lines that the Mozilla folks wanted to preserve are going to get used anyway.
It's really a pretty terrible idea for Mozilla to favor 32-bit code over 64-bit code these days. Not so much because of any feature or perf differences, but because it signals that they don't really have a clue what they are doing.
As a lucky asm hacker and frequent advisor to compiler codgen engineers, in the past few years re-order engines have gotten to be wide enough that it's really not too bad. It simply requires a different mindset; instead of attempting to hand-schedule individual instructions, one instead schedules small blocks of instructions chosen to minimize state that must be preserved across block boundaries (to minimize register usage between blocks) and lets the rename/reorder do their work.
There are some (mostly numeric -- hand-tuned FFTs come to mind) codes for which the lack of names really does cause difficulty, but they are relatively few and far between.
All that aside, yes, 8 GPR names is just absurd, and targeting x86_32 is ridiculous when x86_64 is so widely supported.
Well, pointer size is a concern and pretty much every RISC architecture suffered a slowdown when moving to 64-bit pointers. It's just that x86-64 added so many good things that it made up for that disadvantage.
(Source: extensive firsthand experience with performance-sensitive C/C++ development for Windows.)