Looks like model number 06/66 or better in cpuinfo. Both AWS and Gcloud standard vm instances appear to be Skylake ;(
Honestly, base64 has always been pretty fast for me, even large bitmaps >100kb. Compared to net latency it may be imperceptible to end user. But it still makes a terrific reference paper!
Plus, if my knowledge is properly up-to-date, no AMD processor supports those instructions, not even any member the recent and famously super efficient Ryzen family processors.
Because those are commodity processors nonetheless. The distinction relevant for a research paper is between mainstream processors you can just buy vs. custom made ones, not about market penetration.
I still think "commodity" is not the correct word even in that context. Mass-market, or commercially available might be better. How can it be a commodity when there is only one manufacturer?
Just going off wikipedia (also other sources support the definition):
> In economics, a commodity is an economic good or service that has full or substantial fungibility: that is, the market treats instances of the good as equivalent or nearly so with no regard to who produced them.
Basically it's a common good, that is sold and purchased, with no other modifications required for this use-case. This commodity is currently low in supply, but is still a commodity.
It’s missing the fungibility where it is produced by multiple suppliers and can be treated equivalently by the end buyer. Large grade A eggs are a commodity as it doesn’t matter which farm produced them. A specific set of processors from Intel are not a commodity since there are no fungible equivalents. Widely available would probably be a better description in this case.
I would argue processors are never commodities anymore (maybe back in the 486 days it was closer). You can't swap an intel for an amd processor into the same motherboard ever today. They don't have the same performance characteristics from one to another. You could have 2 "3.5Ghz 4-core processors" that have wildly different real-world performance. IMO this is the opposite of "fungible"
I think hard drives, memory, power supplies are much closer to being fungible.
If base64 decoding speed makes a difference in your application, you should be considering why you are transmitting data in base64, a neat hack from the 1970's, which is non-human-readable, wastes 30% of the network bandwidth, kills compression, wastes RAM, is typically hard to random-access, and is generally a bad idea all round.
I agree, IMHO is the biggest design fail and annoyance of JSON that it has no good way to contain binary data.
The problem start with knowing the encoding (base64, for small sizes sometimes hex, for one or two bytes sometimes arrays with all there ambiguities).
Then that you can't easily differentiate between a string a bytearray anymore (through you often shouldn't need to).
Then that it becomes noticable bigger, which especially for large blobs is a problem
Then that you always have to scan over this now larger large chunk to know it's end.
(Inserted of just skipping ahead).
Then that you have to encoded/decide it which might require some additional annotations depending on serialization library and can imply other annoyances.
The later point can also in some languages lead to you accidentally using a base64 string as bytes or the other way around.
JSON human readable is extremely nice. FWIW, I’ve found some binary formats like CBOR and MsgPack to be so almost human readable as binary when you are even slightly familiar with the format, or just convert almost instantly to readable JSON.
When I was developing a three component system, protos were not working for me. Too many schema changes and implementation differences to make the reduced binary size worth it. This was partly because the best protobuffer implementation in C (nanoPB) is just the hobby project of some dude, but mostly because coordinating schemas was annoying.
Protobuffs require sharing of a schema, then code generation for each language. Not ideal imo unless you really need the speed and the protos can be shared ahead of time.
Outside of Mongo, I haven’t seen BSON used anywhere.
CBOR however, is up and coming. Starting to look like it may be a first class citizen in AWS someday. On the IoT side they are already preferring it over JSON for Device Defender.
Base64 comes from the development of MIME in the early 1990s (see https://tools.ietf.org/html/rfc1341 section 5.2) though there are other similar encodings such as uuencode which predate it.
I agree with the spirit and most of the content of this comment, except for the hard to random access part. It's not really hard to random access base 64.
Practically, base64 encode a video file and tell me how exactly you're going to allow the user to seek to any place they like within that video using only common libraries on common platforms... Theoretically easy, practically hard enough nobody does it.
If the data being base64 encoded will also then be compressed, it is better to base16 encode (hex) and then apply compression. Base16 is faster to encode and compresses better (probably compresses faster, didn't test that).
2537012 Nov 6 10:10 test.tar.hex.gz
2954608 Nov 6 10:03 test.tar.b64.gz
All the caveats, test with your data, etc.
What would be nice is an email/http resilient compressed format that could be stuck inside of json strings that can be easily recovered.
That being true, it might be worthwhile for compression algorithms to detect base64 encoding (or any other base for that matter) and reinterpret it as base16 in order to improve compression rates.
base64 is 48.8% larger after compression on english text, whereas it would only be 33.3% larger without compression. The reason is because compression finds and eliminates repeating patterns, but base64 can make the same input data look totally different depending on which of the 4 input alignments it has.
Most compression algos operate on single-byte chunks of data and base64 encoding messes up the original byte alignment making the input appear more random than it is.
I don't follow. An image compressor is guaranteed not to be given Base64 input.
A general-purpose compression scheme can easily detect when its input is Base64, so I don't see why Base64 should be particularly hard to compress, in principle at least.
Why is that an achievement? Isn't it so that during a memory copy the CPU is basically idling because memory is so much slower than the CPU? Weren't that hundreds of instructions per memory access? So instead of having it wait it can also do computations (just standard instructions). Or is base64 computationally so heavy that it cannot fit into the "gaps"? I certainly have not tried, just thinking from the high level view that CPUs are incredibly fast and main memory is incredibly slow in comparison. And of course assuming that the data does not fit into any cache.
There are tables in the paper comparing its speed to other implementations. This is more than an order of magnitude faster than the implementation used in Chrome, and for small data more than twice as fast as their previous AVX2 based implementation (https://arxiv.org/abs/1704.00605). The paper even notes that in some cases, for large inputs, it's decoding base64 faster than you could memcpy the data for the simple reason that it needs only write 6/8ths of the bytes. In their tests, it's only when the data fits in the tiny L1 cache that memcpy is consistently somewhat faster (44 GB/s vs 42 GB/s). So this is not only fast when you have to wait for a slow memory access, but when the loads hit other caches.
Or are you asking in more of a rhetorical sense? Like, why hasn't this been achieved before, that it's obvious or something? AVX-512 is only some 3 years old and base64 is only one of many problems that could benefit from vectorized processing.
You're mostly right. It's often not that hard to completely saturate memory bus.
However, base64 has weird mappings that take some processing to undo in SIMD – can't use lookup tables and need to regroup bits from 8 bit to 6 bit width. That does take a lot of cycles without specialized bit manipulation instructions.
Also the data you'll need to process is often probably already hot in the CPU caches. Disk/network I/O can be DMA'd directly into L3 cache.
So I think this decoder is a useful contribution. But also somewhat no-brainer once you have those AVX-512 instructions available.
> However, base64 has weird mappings that take some processing to undo in SIMD – can't use lookup tables and need to regroup bits from 8 bit to 6 bit width. That does take a lot of cycles without specialized bit manipulation instructions.
Base64 uses lookup tables and the bit manipulations required are standard shifts and 'and', which are basic, fast instructions on any CPU. That seems exactly what they do here with an efficient use of AVX512 to make it fast(er).
Because table lookups don't vectorize. You could try to use vectorized gather (VPGATHERDD [0]), but so far in my experience it's been just as slow as using scalar code. (Then again, even gather instructions operate just on 32-bit data, so it wouldn't help anyways.)
So to perform multiple operations in parallel per core (SIMD), you'll have to write some code to perform the "lookup" transformation instead.
I think you got it all wrong, memcpy comparison is given as a baseline.
It just shows even if you do not do anything (just copy the memory from place A to place B), you have some CPU usage baseline per GB. So this kind of saying: with this algorithm, you have very minimal CPU usage.
This algorithm is not doing excessive memory usage and Idling CPU because of memory bandwith, it is just doing very minimal CPU usage.
While it is correct that memory is orders og magnitudes slower than modern cpus, this is also why cpus sport multiple levels of caches.
memcpy operations (and base 64 encoding/decoding) are highly "local" operations. When the segment of memory they operate on has been placed in the cpu level1 or level2 caches, the operation is much faster and will not need to access main memory for each byte/word. Only when a memory barrier is encountered will the cpu need to flush/refresh to/from main memory.
You are thinking of random memory access latency, which is very slow. Reading memory in spans that are contiguous is very fast in comparison because of prefetching. Writing is the same to a certain degree because of the same efficiency with TLBs but is not as fast as prefetched reads.
Not yet usable asis. Normally AVX-512 instructions are subject to downclocking. Now they tried a very recent CPU which is not subject to downclocking anymore. They didn't add code to test for the CPU revision which stopped downclocking, and they didn't benchmark with those older CPU's. We can only guess the older AVX256 code is faster there, but not how much.
Speaking of benchmarking against older CPUs: the paper falls into the "short communication" category, we didn't want to discuss all possible hardware configurations.
> Only the non portable AVX2 based libraries are faster
The paper in OP is about 2 times as fast as their previous AVX2 code. I think they are superior to Turbobase64 here (though with more stringent hardware requirements).
Even 5% sounds sketchy indeed. However, AVX2 (or Haswell New Instructions) hasn't quite been on every Intel processor this decade, rather only since the 4th generation (late 2013), and only in Core-branded processors.
I think 40GB/s result when it is not fitting L1 cache. If you check result graph, On L1 cache fitting data memcpy is ahead with big difference, then they are almost head to head
iOS 13 allows the Shortcuts app to encode/decode information as Base64 as part of the scripting language.
People are using this to do clever things like encoding a watermark image to overlay on a picture, because the Shortcuts app does not allow file attachments as part of the scripting language.
Now show me memory speed SIMD uint8_t histogram, and I'll be happy. Computing histogram is slow and it doesn't really vectorize. Yet it's required by many important algorithms like data compression.
There is new AVX512 instruction (_mm512_conflict_epi32) supposed to solve this, but it can't make the histogram construction faster than the scalar functions.
Unfortunately using AVX512 instructions only gets a speedup in very specific situations and for many real world use cases it actually underperforms due to oddities of scaling and switching delays. Profiling for more than a few milliseconds is one place you see phantom gains, so take care not to be deceived.
Edit: not saying this isn't a true benefit here, just that claims of speed when using AVX512 need to be treated with fair scepticism for actual use cases.
I didn't realise that. My original post was just to warn that AVX512 benchmarks can be highly misleading. Everyone has troubles measuring AVX512 performance:
He is aware, but sidestepped these issues. so this code is only recommended on the newest Cannon Lake processors, but we really want to know for which CPU which method is best. What about AMD Rome e.g.?
Question: with this kind of optimization, how does the program run different functions on different CPUs? Like the ones that don't support AVX512? Some kind of runtime dispatch based on CPU ID?
For instance our SSE/AVX2 algorithms have been included in a great, mature library written and maintained by Alfred Klomp: https://github.com/aklomp/base64 (the library includes also vectorized code for ARM CPUs).
Many people don't realize this but today's memory is dramatically underpowered for today's CPU. Consider the following: you can, theoretically, get ~80GB/sec of memory bandwidth from a modern Intel CPU (assuming you use all channels, and there are no interrupts, etc). That's 20 billion floats or int32s per second. The same CPU can do ~2.3 fp32 TFLOPS. That's 115 ops for every fp32. And it's getting worse as more and more cores are put on the same memory bus.
For a flea check, consider that the chip has 18 cores and runs at ~4.2GHz. To achieve the theoretical throughput each core has to do ~30 single precision flops per cycle. Which sounds about right: HEDT intel chips can (theoretically) do 32 of those per cycle.
This, of course, ignores whatever throttling Intel would need to apply to maintain TDP, so a purely theoretical back of the envelope.
This should change soon, as EPYCs can do 204GB/s of memory throughput (plus tons of pcie4.0). They also aren't cpu tied, all of the SKUs get all of the lanes
It's a NUMA though. You'll have to write software specifically for NUMA to get anywhere close to that number. To make matters worse, EPYC also doubles the number of cores. To be fair, though, it also has a substantial amount of cache, but cache is not a panacea if you need something in the main RAM. And if you're churning through gigabytes of stuff per second, you'll be needing that very, very often.
Yes, NUMA and large numbers of core will pose completely new challenges for software optimization. Applications will have to become aware of the memory architecture of the underlying machine. They will have to make explicit assignments of memory allocations and threads to NUMA nodes based on their domain specific needs. In some cases, even duplicating data structures may be the right call. This is going to challenge how most developers write fast software.
The other thing is that even seemingly trivial things like spawning and synchronizing with lots of threads will be much more complex on CPUs with many cores. At some point, naively looping over all threads is going to be too slow. I think that the limit is going to be around 64 cores. Past that, you should actually parallelize your worker thread management to stay efficient. There is precendent for this in HPC, e.g. MPI implementations.
As an anecdote I have a Phoenix LiveView project where I've discovered that loading, converting and sending an (uncacheable) image as base64 via the websocket connection feels quicker and smoother (I didn't benchmark it) than only updating the src path and letting the browser load it (with http1, I would have to compare with http2).
Well nothing's stopping you from making a living doing this kind of programming. I think the tech landscape is vibrant enough to have fun with it in many different ways!
Not only it is just a standard approach, it even misses a relatively common optimization for base64 decoding: instead of computing `(lut[a] << 6) | lut[b]` etc., one can precompute `lut6[x] = lut[x] << 6` and compute `lut6[a] | lut[b]` to avoid shifting. This optimization is famously used by Nick Galbreath's MODP_B64 decoder, which is used by Chromium [1] and turns out to be the most performant non-SIMD decoder according to Lemire et al. [2]