Relatedly, it'd be cool if one of the AMD people involved in the OpenCL work (which was committed at the same time) could do a writeup of those optimizations.
The penalty is at most ~1 cycle of latency -- in practice I find it gets completely absorbed by the OOE engine. I've never measured any significant penalty in any code for mixing float and int SSE operations on any x86 microarchitecture.
Floating point bitwise operations exist too: xorps, andps, and so on.
In all fairness, Intel's emulator is incredibly easy to use; it literally works like this:
sde -- ./myprogram myargs
instead of
./myprogram myargs
There's also probably a decent number of people at this point who have prerelease CPUs; they tend to breed quite explosively in the month or two before the official release.
I only pushed the code a few minutes ago, but binaries should probably be up at http://x264.nl/ relatively soonish (it's not my site though, so I wouldn't know exactly).
If you want to test without a physical Haswell, the Intel Software Development Emulator should work okay, albeit somewhat slowly. I'd post overall numbers for real Haswells, but Intel has apparently said we can't do that yet.
Regarding FMA, FMA3/4 are floating point only. Since x264 has just one floating point assembly function, only two FMA3/FMA4 instructions get used in all of x264 (not counting duplicates from different-architecture versions of the function). An FMA4 version has been included for a while; the new AVX2 version does include FMA3, but of course that won't run on AMD CPUs (yet).
XOP had some integer FMA instructions, but I generally didn't find them that useful (there's a few places I found they could be slotted in, though).
I've heard that there are c libraries for things like SSE2. I assume the same is true of AVX2. If this is so, why do you write so much of x264 in assembly? Do you find that there are significant gains versus c-code that uses SIMD libraries? Have I been misled that C is nearly as fast as assembly 99% of the time?
Note: I'm not trying to question your engineering chops, just trying to correct my own misconceptions.
"C libraries for things like SSE2"? Do you mean math libraries that have SIMD implementations of various functions that are callable from C? This here is effectively writing those libraries; they don't exist until we write the code.
I'm not an SIMD expert, but it seems like this implements similar primitives to those that are available to assembly (and not C). My question is basically whether the algorithms you're talking about could be implemented with these primitives. Although I guess no such library yet exists for AVX2.
Intrinsics aren't really C; they work in a C-like syntax, but you're still doing the exact same thing as assembly: you still have to write out every instruction you want to use, so you're not really saving any effort compared to just skipping the middleman.
In return, you are stuck with an extremely ugly syntax and a much less functional preprocessor, with the added bonus of a compiler that mangles your code.
With intrinsics, you don't have to think about register naming. You still might count registers to avoid spills (and check the assembly to make sure), but there is less of a mental context switch than writing straight assembly.
I almost never spend more than a few seconds considering register allocation/naming when writing assembly (part of this is because x264's abstraction layer lets macros swap their arguments, so you don't have to track "what happens to be in xmm0 right now" mentally). In some rare cases it can get tricky when you start pushing up against the register cap, but that's exactly the case where the compiler tends to do terribly, and you'd want to do it yourself.
The pain of not having a proper macro assembler in C intrinsics is orders of magnitude worse than having to do my own register allocation in yasm, so for now, yasm is the lesser of two evils.
Is there any hope of a compiler ever coming close to the level of optimization you can get from hand-coded assembly language? The numbers in your table routinely exceeded 10x gains over straight C. What's the compiler doing that's taking so long? Is it not able to vectorize at all?
x264 actually turns off vectorization in the configure script, because it's caused crashes and bugs in the past on various platforms. But even if you enable it, it usually almost never triggers. Even the Intel compiler's autovectorization only triggers in a few functions, despite its reputation, and typically does an pretty mediocre job.
The problem has many parts:
1. Autovectorization in general is just extremely difficult and even trivial code segments often get compiled very badly. It feels like the compiler is trying to fit the code into a few autovectorization special cases -- for example, a 16x16->16 multiply gets compiled into a 16x16->32 multiply, and then it laboriously extracts the 16 bits, probably because nobody wrote code to explicitly handle the former variant. A good autovectorizer would have to have a vast array of these sort of things to "know what to do" in a particular case, I'd imagine.
A lot of autovectorization resources seem to be tuned towards floating point math (which typically doesn't need the same sort of tricks), which probably exacerbates the problem in x264's case.
2. The compiler doesn't know enough. It can't guarantee alignment, it doesn't know about the possible ranges of the input values or the relationship among them, it doesn't know the things the programmer knows.
3. SIMD algorithms are often wildly different from the original C. Much of the process of writing assembly is figuring out how to restructure, reorder, and transform algorithms to be more suitable for SIMD. The compiler can't really realistically do this; its job is to translate your C operations into machine operations, not rewrite your algorithm.
Part of this problem is that C is just not a great vector math language, but part of it is also that the optimal algorithm structure will depend on the capabilities of your SIMD instructions and their performance. For example, when the pmaddubsw instruction is available, it's faster to do x264's SATD using a horizontal transform first, but if not, it's faster to do it with a vertical transform first. The Atom CPU has pmaddubsw, but only has a 64-bit multiplier, making it too slow to utilize the horizontal version (so it gets the vertical version instead).
You can definitely finagle code into getting hit by the autovectorizer, especially with Intel's compiler, but it takes a lot of futzing to make it happy, and even when it is happy, it can be many times slower than proper assembly. Of course, it's not useless -- it can get you some relatively free performance improvements without writing actual SIMD code. But it's not a replacement.
(I guess DarkShikari's comment is nested too deeply for me to reply directly.)
In my (admittedly limited) experience [1], the compiler has actually done pretty decently at optimizing register allocation in intrinsic-heavy loops. I wrote out the assembly loop in [2] with manual allocation into all 16 XMMs and then noticed the compiler managed to optimize 1 of them out.
I'm going to check with CoreCodec (the folks helping us administer x264 LLC) and see what's going on here. If they're abusing the terms of the license, we'll make sure things get fixed. If not, we'll publish all the changes they've made -- and honestly, I would be shocked if they've done anything significant besides change the program name.
There's still a lot of cases where it's very hard for even a consumer-facing US-based web startup to get traction in another country, for language and cultural reasons. Japan is a particularly notable case; it's often said that companies without an office in Japan never succeed there. Some examples:
Pixiv vs. DeviantArt
Mixi vs. Facebook
Nico Nico Douga vs. Youtube
China is an even more extreme case, though that also has problems of corruption, legal wrangling, censorship, and so on.
Facebook began gaining more traction in Japan once they setup a special office in Tokyo and let them make tweaks to the product only for the Japanese market - AFAIK, this was a unique case for them at the time, and they didn't do anything besides basic localization (which they crowdsourced...) for other nations. One of the things they did to show that they understood Japanese culture was to add a blood-type field to user profiles, a superstitious statistic that's nevertheless prevalent there. Before that, it felt too foreign.
Anecdotal but I've had many more of my Japanese and Korean friends abroad join Facebook in the last year or two.
I agree that those helped, but when the people around me started using facebook was right around the time "Social Network" came out in Japan.
I used to be on Mixi, and I and most people I know have pretty much stopped using it. When they started tying in their services to other APIs like Twitter, etc, I think is when they really jumped the shark.
(Also, Mixi is not really a Facebook clone, per say, as they both launched in the same month. I would say they were just another SNS that was more geared towards the Japanese market)
At one point they were the highest selling piece of software (mac or pc) in Japan, which they achieved via a Japanese distributor - not sure if that counts as an office in the country or not.
Candy is hardly a gendered product, and foreign interest in Japanese culture and food isn't either. Adding gendered advertising in such a way is unlikely to gain many new leads, but will certainly limit the audience dramatically.
For some reason, the "testimonials from random Japanese people" idea seems rather unsettling to me. It makes it feel exoticized, touristy, and fake.
The sparse Fourier transform is also probably not useful for any of that stuff, because accurate results are needed (not just getting one or two of the main tones). Transforms are not the bottleneck at all in video encoding, and even in audio encoding, split-radix FFTs and the like are very fast.
High-res Dirac video encoding is definitely possible in real-time, you just need a very optimized encoder, which doesn't really exist. Dirac's main speed cost comes from the overlapped-block motion compensation, not the transform.
How does this differ in practice from Etsy? I see you can embed it into another site, but are there any other advantages or differences in typical usage, like fee structure, searchability, and so forth?
Its a lot easier to get started with ShopLocket than with something like Etsy and the ability to select a style and embed it on your own site is a big plus. We plan to add custom CSS to widgets in the coming months, which should make widgets even more appealing.
Also on Etsy you still have a "store", and it only makes sense for certain categories of products. We're all about putting your products in front of your audience and the people that matter, rather than you just being another product in a generic marketplace. We want to create the ideal "display case" — a platform as awesome as the products people are selling.
Relatedly, it'd be cool if one of the AMD people involved in the OpenCL work (which was committed at the same time) could do a writeup of those optimizations.