I/O is no longer the bottleneck

Sirupsen · on Nov 26, 2022

Yes, sequential I/O bandwidth is closing the gap to memory. [1] The I/O pattern to watch out for, and the biggest reason why e.g. databases do careful caching to memory, is that _random_ I/O is still dreadfully slow. I/O bandwidth is brilliant, but latency is still disappointing compared to memory. Not to mention, in typical Cloud workloads, IOPS are far more expensive than memory.

[1]: https://github.com/sirupsen/napkin-math

derefr · on Nov 26, 2022

> _random_ I/O is still dreadfully slow. I/O bandwidth is brilliant, but latency is still disappointing compared to memory.

The wondrous thing about modern CPU architectures (e.g. Zen3), though, is all the PCIe lanes you get with them. If you really need high random IOPS, you can now cram 24 four-lane NVMe disks into a commodity server (with PCIe M.2 splitter cards) and saturate the link bandwidth on all of them. Throw them all in a RAID0, and stick a filesystem on them with the appropriate stripe width, and you'll get something that's only about 3x higher-latency for cold(!) random reads, than a read from RAM.

(My company provides a data-analytics SaaS product; this is what our pool of [shared multitenant, high concurrency] DB read-replicas look like.)

Dylan16807 · on Nov 26, 2022

I thought NVMe flash latency was measured in tens of microseconds. 3x RAM would be a fraction of a microsecond, right?

derefr · on Nov 26, 2022

Under ideal conditions, yes. But the 3x difference I see in practice is less about NVMe being just that good; and more about operations against (main) memory getting bottlenecked under high all-cores concurrent access with no cross-workload memory locality to enable any useful cache coherence. And also about memory accesses only being “in play” when a worker thread isn’t context-switched out; while PCIe-triggered NVMe DMA can proceed while the thread has yielded for some other reason.

In other words, when measured E2E in the context of a larger work-step (one large enough to be interrupted by a context-switch), the mean, amortized difference between the two types of fetch becomes <3x.

Top of my wishlist for future architectures is “more, lower-width memory channels” — i.e. increased intra-CPU NUMAification. Maybe something CXL.mem will roughly simulate — kind of a move from circuit-switched memory to packet-switched memory, as it were.

MonaroVXR · on Nov 27, 2022

How do you figure these things out, do you have special software to look at this?

emn13 · on Nov 26, 2022

I think the person you're replying to is confusing IOPS with latency. If you add enough parallelism, then NAND flash random-read IOPS will eventually reach DRAM performance.

But it's not going to be easy - for a sense of scale I just tested a 7950x at stock speeds with stock JEDEC DDR5 timings. I inserted a bunch of numbers in an 8GB block of memory, and with a deterministic random seed randomly pick 4kb pages, computing their sum and eventually reporting that (to avoid overly clever dead-code analysis, and make sure the data is fully read).

With an SSD-friendly 4K page size that resulted in 2.8 million iops of QD1 random read. By comparison, a web search for intel's 5800x optane's QD1 results shows 0.11 million iops, and that's the fastest random read SSD there is at those queue depths, AFAIK.

If you add parallelism, then ddr5 reaches 11.6 million iops at QD16 (via 16 threads), fast SSDs reach around 1 million, the optane reaches 1.5 million. An Epyc Genoa server chip has 6 times as many DDR5 memory channels as this client system does; and I'm not sure how well that scales, but 60 million 4kb random read iops sounds reasonable, I assume. Intel's memory controllers are supposedly even better (at least for clients). Turning on XMP and PBO improves results by 15-20%; and even tighter secondary/tertiary timings are likely possible.

I don't think you're going to reach those numbers not even with 24 fast NVMe drives.

And then there's the fact that I picked the ssd-friendly 4kb size; 64-byte random reads reach 260 million iops - that's not quite as much bandwidth as @ 4kb, but the scaling is pretty decent. Good luck reaching those kind of numbers on SSDs, let alone the kind of numbers a 12-channel server might reach...

We're getting close enough that the loss in performance at highly parallel workloads is perhaps acceptable enough for some applications. But it's still going to be a serious engineering challenge to even get there, and you're only going to come close under ideal (for the NAND) circumstances - lower parallelism or smaller pages and it's pretty much hopeless to arrive at even the same order of magnitude.

jlokier · on Nov 26, 2022

I measured ~1.2M IOPS (random reads, 4kiB) from 3xNVMe in a software RAID configuration on a commodity server running Ubuntu Linux in 2021. Using Samsung SSDs, not Optane.

If that scaled, it would be 9.6M IOPS from 24xNVMe.

emn13 · on Nov 27, 2022

Which is quite respectable, but still nevertheless a far cry from the presumable 60M+ iops the server would have using dram (if it scales linearly, which I doubt, it would hit 70M). Also, DRAM gets quite close to those numbers with only around 2 times as many threads as dram channels, but that NVMe setup will likely need parallelism of at least 100 to reach that - maybe much more.

Still, a mere factor 7 isn't a _huge_ difference. Plenty of use cases for that, especially since NAND has other advantages like cost/GB, capacity, and persistence.

But it's also not like this is going to replace dram very quickly. Iops is one thing, but latency is another, and there dram is still much faster; like close to 1000 times faster.

867-5309 · on Nov 26, 2022

at this point cost would become the bottleneck. compare 24x1TB NVMe drives to 24TB of DDR5

emn13 · on Nov 26, 2022

That's an entirely different dimension - you can reach these throughput numbers on DDR5 likely even with merely 16GB. And an massive 12-channel 6TB socket solution will likely have slightly less than 6 times the random-read bandwidth. Capacity and bandwidth aren't strongly related here.

wil421 · on Nov 26, 2022

I’m running a FreeNAS box on an i3-8100. Right now I’m converting the NAS and my desktop to server chassis and putting them in a rack. Once I get a 10GB Unifi switch and NICs off ebay, I’m debating on running my desktop and servers diskless using iSCSI backed up by RAID0 NVME drives.

justsomehnguy · on Nov 26, 2022

Whatever floats your boat, but iSCSI is limited to 1500 MTU (9k? Are you sure you can boot with 9k enabled?) and while you can have 10Gbit thoughput that doesn't mean what you will get it always, eg 100 IO operations would generate 100 packets and it doesn't matter if it was 1500B each or only 100B.

And you wouldn't see the speed improvement on RAID0 NVMe drives except extremely rare fully sequential operations lasting for at least tens of seconds.

You also can try it just by running a VM with iSCSI boot on your current desktop.

TristanBall · on Nov 26, 2022

Been a long time since anything iscsi related didn't hand 9k, for boot or otherwise.

But I look at it this way. You need 40gbit networking for a single pci3 nvme ( and newer drives can saturate that, or close )

And because you're throttling throughput you'll see much more frequent, longer, queuing delays, on the back of a network stack that ( unless you're using rdma ) is already 5x-10x slower than nvme.

It'll be fast enough for lots of things, especially home/lab use, and it'll be amazing if you're upgrading from sata spinning disk.. but 10gbit is slow by modern storage standards.

Of course, that's not the only consideration. Shared storage and iscsi in particular can be extremely convenient! And sometimes offers storage functionality that clients don't have ( snapshots, compression, replication )

justsomehnguy · on Nov 26, 2022

> Been a long time since anything iscsi related didn't hand 9k, for boot or otherwise.

Don't have anything on the hands to look if the boot firmware even allows to set 9k, but I didn't touch iSCSI boot for a long time, so I would take your word for it.

> But I look at it this way. You need 40gbit networking ... is already 5x-10x slower than nvme.

This one.

> It'll be fast enough for lots of things, especially home/lab use

Yep, in OP's case I would consider just leaving the OS on the local [fast enough] drive and using iSCSI (if for some reason NFS/SMB doesn't fit) for any additional storage. It would be fast enough for almost everything, while completely eliminating any iSCSI boot shenanigans /me shudders in Broadcom flashbacks.

Another neat thing about iSCSI is what you can re/connect it to any device on the network in a couple of minutes (first time, even faster later), sometimes it comes really handy.

ilyt · on Nov 26, 2022

> Whatever floats your boat, but iSCSI is limited to 1500 MTU (9k? Are you sure you can boot with 9k enabled?) and while you can have 10Gbit throughput that doesn't mean what you will get it always, eg 100 IO operations would generate 100 packets and it doesn't matter if it was 1500B each or only 100B.

Ugh, ISCSI does have queueing so you can have many operations in flight, and one operation doesn't really translate to one packet in the first place, kernel will happily pack few smaller operations to TCP socket into one packet when there is load.

The single queue is the problem here but dumb admin trick is just to up more than one IP on the server and connect all of them via multipath

justsomehnguy · on Nov 27, 2022

> kernel will happily pack few smaller operations to TCP socket into one packet when there is load.

And here comes the latency! shining.jpg

It wouldn't be a problem for a desktop use of course[0], especially considering what 90% of operations are just read requests.

My example is crude and was more to highlight what iSCSI, by virtue of running over Ethernet, inherently has a limit of how many concurrent operations can go in one moment. It's not a problem for a HDD packed SAN (HDDs would impose an upper limit, because spinning rust is spinning) but for a NVMe (especially with a single target) it could diminish the benefits of such fast storage.

> The single queue is the problem here but dumb admin trick is just to up more than one IP on the server and connect all of them via multipath

Even on a single physical link? Could work if the load is queue bound...

[0] hell, even on 1Gb link you could run multiple VMs just fine, it's just when you start to move hundreds of GBs...

ilyt · on Nov 27, 2022

>> kernel will happily pack few smaller operations to TCP socket into one packet when there is load.

>And here comes the latency! shining.jpg

Not really, if you get data faster than you can send packets (link full) there wouldn't be that much extra latency from that (at most one packet length which at 10Gbit speeds is very short) and it would be more than offset by the savings

Then again I'd guess that's mostly academic as I'd imagine not very many ISCSI operations are small enough to matter. Most apps read more than a byte at a time after all, hell, you literally can't read less than a block from a block device which is at least 512 bytes.

>> The single queue is the problem here but dumb admin trick is just to up more than one IP on the server and connect all of them via multipath

> Even on a single physical link? Could work if the load is queue bound...

You can also use it to use multiple NICs without bonding/teaming, althought it is easier to have them in separate network, IIRC linux had some funny business when if you didn't configure it correctly for traffic in same network it would pick "first available" NIC to send it and it needed /proc setting to change

To elaborate, default setting for /proc/sys/net/ipv4/conf/interface/arp_ignore (and arp_announce) is 0 which means

> 0 - (default): reply for any local target IP address, configured on any interface

> 0 - (default) Use any local address, configured on any interface 1

IIRC to do what I said required

    net.ipv4.conf.all.arp_ignore=1
    net.ipv4.conf.all.arp_announce=2

which basically changed that to "only send/respond to ARPs from NICs where actual address exists, not just ones with the address in same network" and fixed the problem.

justsomehnguy · on Nov 27, 2022

> I'd guess that's mostly academic

It is, that mattered on 1Gbit links with multiple clients, ie any disk operations in VMs while there is vMotion running on the same links - you could see how everything started to crawl (and returned back after vMotion completed). For 10Gbit you need way, way more load for it to matter.

> You can also use it to use multiple NICs without bonding/teaming

You MUST (as in RFC) use multiple links without bonding and I learned to not to use LACP the hard way (yea, reading docs before is for pussies).

After second attempt I understood the implication (multiple NICs in the same IP network), but this is a self inflicted wound, usually. You don't even need a physically separate networks (VLANs), but using separate IP networks works fine, it's up to initiator to use RR/LB on them.

> it would pick "first available" NIC to send it

Yep, the usual magic of doing things to be easier for average folks. In the same vein - you need to disable Proxy ARP in any modern non-flat network or you will get shenanigans what would drive you mad.

wil421 · on Nov 28, 2022

I’m out of SATA ports and I have 2 M.2 slots available. When I can test with VM in my current desktop I will.

ilyt · on Nov 26, 2022

That's a lot of effort to put silent piece of silicon few metres away from the machine.

iSCSI gotta eat some of your CPU (you're changing "send a request to disk controller and wait" to "do a bunch of work to create packet,send it over the network, and get it back) if you don't have card with offload, it also might kinda not be fast enough to get the most out of NVMe, especially more in RAID0

And, uh, just don't keep anything important there...

wil421 · on Nov 28, 2022

It’s an i3 with 2 M.2 slots available. Enough for the home. SATA becomes the limit.

gr__or · on Nov 27, 2022

As a dev who operates fairly far away from hardware usually, is that similar to what the PS5 is doing?

marginalia_nu · on Nov 26, 2022

Depending on how your madvise is set up, it's often the case that sequential disk reads are memory reads. You're typically only paying the price for touching the first page in a sequential run, that or subsequent page faults come at a big discount.

If you read 1,000,000 random bytes (~1 Mb) scattered across a huge file (let's say you're fetching from some humongous on-disk hash table), it will to a first order be about as slow as reading 4 Gb sequentially. This will incur the same number of page faults. There are ways of speeding this up, but only so much.

Although, I/O is like an onion of caching layers, so in practice this may or may not hold up depending on previous access patterns of the file, lunar cycles, whether venus is in retrograde.

Sirupsen · on Nov 26, 2022

`madvise(2)` doesn't matter _that_ much in my experience with [1] on modern Linux kernels. SSD just can't read _quite_ as quickly as memory in my testing. Sure, SSD will be able to re-read a lot into ram, analogous to how memory reading will be able to rapidly prefetch into L1.

I get ~30 GiB/s for threaded sequential memory reads, but ~4 GiB/s for SSD. However, I think the SSD number is single-threaded and not even with io_uring—so I need to regenerate those numbers. It's possible it could be 2-4x better.

[1]: https://github.com/sirupsen/napkin-math

marginalia_nu · on Nov 26, 2022

I think the effects of madvise primarily crop up in extremely I/O-saturated scenarios, which are rare. Reads primarily incur latency, with a good SSD it's hard to actually run into IOPS limitations and you're not likely to run out of RAM for caching either in this scenario. MADV_RANDOM is usually a pessimization, MADV_SEQUENTIAL may help if you are truly reading sequentially, but may also worsen performance as pages don't linger as long.

But as I mentioned, there's caching upon caching, and also protocol level optimizations, and hardware-level considerations (physical block size may be quite large but is generally unknown).

It's nearly impossible to benchmark this stuff in a meaningful way. Or rather, it's nearly impossible to know what you are benchmarking, as there are a lot of nontrivially stateful parts all the way down that have real impact on your performance.

There are so many moving parts I think the only meaningful disk benchmarks consider whatever application you want to make go faster. Do the change. Is it faster? Great. Is it not? Well at least you learned.

menaerus · on Nov 26, 2022

> I get ~30 GiB/s for threaded sequential memory reads, but ~4 GiB/s for SSD. However, I think the SSD number is single-threaded and not even with io_uring—so I need to regenerate those numbers. It's possible it could be 2-4x better.

Assuming that you run the experiments on NVMe SSD which is attached to PCIe 3.0, where theoretical maximum is around 1GB/s per each lane, I am not sure I understand how do you expect to go faster than 4 GiB/s? Isn't that already a theoretical maximum of what you can achieve?

formerly_proven · on Nov 26, 2022

PCIe 4.0 SSDs are pretty common nowadays and are basically limited to what PCIe 4.0 x4 can do (around 7 GB/s net throughput).

menaerus · on Nov 26, 2022

I don't think they're that common. You would have to have quite recentish motherboard and CPU that both support PCIe 4.0.

And I'm pretty sure that parent comment doesn't own such a machine because otherwise I'd expect 7-8GB/s figure to be reported in the first place.

dagmx · on Nov 26, 2022

I really doubt they’re that common. They only became available on motherboards fairly recently, and are quite expensive.

I’d guess that they’re a small minority of devices at the moment.

robocat · on Nov 27, 2022

PCIe 5.0 has just recently started showing up on consumer motherboards.

4.0 might not be common, but surprisingly it is now the previous generation!

Sirupsen · on Nov 26, 2022

You might be very right about that! It's been a while since I did the SSD benchmarks. Glad to hear it's most likely entirely accurate at 4 GiB/s then!

shitlord · on Nov 26, 2022

How'd you measure the maximum memory bandwidth? In Algorithmica's benchmark, the max bandwidth was observed to be about 42 GBPS: https://en.algorithmica.org/hpc/cpu-cache/sharing/

I'm not sure how they calculated the theoretical limit of 42.4 GBPS, but they have multiple measurements higher than 30 GBPS.

BulgarianIdiot · on Nov 26, 2022

> Yes, sequential I/O bandwidth is closing the gap to memory.

Hilariously meanwhile, RAM has become significantly slower compared to CPU performance, i.e. you spend a disproportionate time to read and write to memory, so despite RAM is faster, CPU is way faster.

Which means I/O remains a bottleneck...

jonstewart · on Nov 26, 2022

Random I/O with NVME is slower than sequential I/O still, but the gap between the two has been narrowed considerably and is quite high in historical/comparative absolute terms. To get close to peak random I/O limits, you need to dispatch a lot of commands in parallel—that’s an I/O idiom that doesn’t really exist in high level languages, and I think that’s where a lot of the challenge is.

crote · on Nov 26, 2022

The problem is that a lot of workloads using random I/O have dependencies between the different I/O operations. A database traversing a tree-like index cannot issue the next read until the previous one has finished. You are limited by latency, which for NVMe is still orders of magnitude worse than memory.

jlokier · on Nov 27, 2022

> A database traversing a tree-like index cannot issue the next read until the previous one has finished.

This applies to a single point query in a single tree.

The latency is reduced by overlap in obvious ways as soon as you have (1) a range query because it can read multiple subtrees in parallel, or (2) a query that reads multiple indexes in parallel, or (3) multiple queries from the application to the database in parallel.

This is why it's useful to design applications to make multiple queries in parallel. Web applications are a great example of this. Most applications where I/O performance matters at all have some natural way to parallelise queries.

Less obviously, the interior blocks of a B-tree are a relatively small part of a B-tree. I.e. most of the space is in used leaf blocks. If the database's cache strategy gives preference to interior nodes, and even more preference to nodes closer to the root of a tree, often several interior layers of the tree can fit entirely in RAM and the effect is is to reduce the latency of tree lookups further once the cache is warmed up.

Then even in large databases (a few TB), the latency of a single point query is reduced to a one or two read IOPS (because the leaf page to read which contains the query result is calculated from in-memory data). The application-visible query time is very similar to the I/O subsystem's timing characteristics, and a few MQPS are achievable (= "million queries per second"). Not many database engines achieve this, because they were designed in an area where I/O was much slower, but the I/O architecture does support it.

Source: Wrote a performance-optimised database engine for blockchain archive data, which is extremely random access (because of hashing), in the multiple terabytes range, and the application is bottlenecked on how many queries per second it can achieve. It's like the ideal case for working on random-access I/O performance :-)

the8472 · on Nov 26, 2022

Those are rarely the slow ones though. Lots of software simply has not been written to keep IO queues full. They read a bit of data, process it, read the next bit and so on. On a single thread. This makes all kinds of IO (including network) way slower than it has to be.

For example a tree-index can be parallelized by walking down different branches. On top of that one can issue a prefetch for the next node (on each branch) while processing the current ones.

ilyt · on Nov 26, 2022

Yup, a lot of software is (was?) written with assumptions that mattered with spinning rust. And even if author didn't intend to, serial code generates serial, dependent IO.

ShredKazoo · on Nov 26, 2022

I thought for SSDs it didn't matter whether data was adjacent on disk?

crote · on Nov 26, 2022

Well, yes and no.

With spinning rust you have to wait for the sector you want to read to rotate underneath the read head. For a fast 10.000 RPM drive, a single rotation takes 6 milliseconds. This means that for random access the average latency is going to be 3 milliseconds - and even that's ignoring the need to move the read head between different tracks! Sequential data doesn't suffer from this, because it'll be passing underneath the read head in the exact order you want - you can even take the track switching time into account to make this even better.

SSDs have a different problem. Due to the way NAND is physically constructed it is only possible to read a single page at a time, and accessing a single page has a latency of a few nanoseconds. This immediately places a lower limit on the random read access time. However, SSDs allow you to send read commands which span many pages, allowing the SSD to reorder the reads in the most optimal way, and do multiple reads in parallel. This means that you only have to pay the random access penalty once - not to mention that you have to issue way fewer commands to the SSD.

SSDs try to make this somewhat better by having a very deep command queue: you can issue literally thousands of random reads at once, and the SSD will reorder them for faster execution. Unfortunately this doesn't gain you a lot if your random reads have dependencies, such as when traversing a tree structure, and you are still wasting a lot of effort reading entire pages when you only need a few bytes.

ShredKazoo · on Nov 26, 2022

Interesting, thanks! So it sounds like it's not so much "random" I/O that's slow, but rather "unbatched" I/O or something like that?

Curious to hear your thoughts on this thread if you have time to share: https://news.ycombinator.com/item?id=33752870

mamcx · on Nov 26, 2022

> Unfortunately this doesn't gain you a lot if your random reads have dependencies, such as when traversing a tree structure,

So, this mean Btrees suffer? Which could be the most optimal layout for a database storage where only SSD matters?

I'm working in one that is just WAL-only and scanning all in each operation (for now!) and wanna see what I can do for improve the situation.

jonstewart · on Nov 26, 2022

You really need an NVME interface to the SSD, though. SATA3 is the bottleneck for SATA SSDs

PeterZaitsev · on Nov 27, 2022

It is not just about IO itself but all the Processing which Database (or OS) needs to do for cache management, which in OLTP cases can be very significant. If you just need few bytes from that 8K/16K page you have to read there is little way around 2 orders of magnitude difference.

What we would really benefit is storage which is efficient in small (cpu cache line) size IO

vitiral · on Nov 26, 2022

I question the methodology.

To measure this I would have N processes reading the file from disk with the max number of parallel heads (typically 16 I think). These would go straight into memory. It's possible you could do this with one process and the kernel will split up the block read into 16 parallel reads as well, needs investigation.

Then I would use the rest of the compute for number crunching as fast as possible using as many available cores as possible: for this problem, I think that would basically boil down to a map reduce. Possibly a lock-free concurrent hashmap could be competitive.

Now, run these in parallel and measure the real time from start to finish of both. Also gets the total CPU time spent for reference.

I'm pretty sure the author's results are polluted: while they are processing data the kernel is caching the next block. Also, it's not really fair to compare single threaded disk IO to a single process: one of the reasons for IO being a bottleneck is that it has concurrency constraints. Never the less I would be interested in both the single threaded and concurrent results.

Klinky · on Nov 26, 2022

I agree, I think there is often a faulty assumption by many developers doing benchmarking that their underlying environment doesn't matter. Often I see performance results posted with little mention of the details of the environment. At least here they posted they were using a high-end SSD, but note they just say they're on a "Dell XPS 13", as if there aren't multiple variants of that model produced every year for the last 5 or 6 years.

You're probably also right their multiple test runs resulted in the OS caching data, and a lot of the test runs may have just been testing in-memory performance instead of raw storage I/O performance.

benhoyt · on Nov 27, 2022

Fair call about "Dell XPS 13". I've now clarified in the article (it's a recent 2022 Dell XPS 13 Plus).

Regarding OS caching: I'm trying to avoid this by clearing caches with the "sysctl vm.drop_caches=3" command. Note that I show both cached and uncached numbers.

Klinky · on Nov 27, 2022

It would still be good to call out the exact processor model, as the lowest end i5-1240P has half the L3 cache as the highest i7-1280P.

Also I don't think it was clear you were running "sysctl vm.drop_caches=3" between benchmarking runs of your optimizations. Your table seemed to indicate those were generic initial read/write benchmarks from either dd or hardparm. The site you linked to also had comments on it stating dd is not very good for benchmarking, suggesting fio & a different site[1].

1. https://linuxreviews.org/HOWTO_Test_Disk_I/O_Performance

LastTrain · on Nov 26, 2022

Thank you. I'd be pretty annoyed in this interview. Surely my potential employer would be more interested in having me apply my twenty years of real-world experiences to what I learned in CS240.

benhoyt · on Nov 27, 2022

Not sure why you'd be annoyed? What I'm trying to gauge in my interviews is their real-world experience (not their academic CS knowledge). That's part of the reason I find my line of questioning helpful: I don't penalize them for a "wrong" answer, but more how they think about it. Then we can discuss parallelization, map-reduce techniques, profiling and measurement, and so on.

LastTrain · on Nov 27, 2022

Annoyed because of the framing, that is: the answer isn't "wrong" in the first place.

nomel · on Nov 27, 2022

Interviews usually have extremely limited time, especially if some interesting “tangent” comes up, which often tell the true depth of their knowledge. Maybe the concurrency would be one possible tangent.

Regardless, the concurrent approach would be 90% the same as the single thread approach, leaving it for a good “after the fact” question, assuming the candidate still has time.

Dunedan · on Nov 26, 2022

> I haven’t shown an optimized Python version because it’s hard to optimize Python much further! (I got the time down from 8.4 to 7.5 seconds). It’s as fast as it is because the core operations are happening in C code – that’s why it so often doesn’t matter that “Python is slow”.

An obvious optimization would be to utilize all available CPU cores by using the MapReduce pattern with multiple threads.

I believe that'd be necessary for a fair conclusion anyway, as you can't claim that I/O isn't the bottleneck, without utilizing all of the available CPU and memory resources.

plonk · on Nov 26, 2022

> An obvious optimization would be to utilize all available CPU cores by using the MapReduce pattern with multiple threads.

Nope, the GIL will make that useless. You need to actually implement the tight loops in C/C++ and call that with batches of data to get benefits from threading.

An obvious, but more expensive optimization would be to use a process pool. Make sure that all the objects you pass around are serializable.

Python makes optimization much harder than it should be. I hope the GIL gets the hammer at some point, but that seems to be a huge task.

xmcqdpt2 · on Nov 26, 2022

For this problem the multiple process version would be quite simple in python or any other languages. It's a classic same program multiple data (SPMD) task. You split the file into N chunks than run N versions of the original program on it (a Map). You then need to collate the results, which required a second program, but that step is similar to the sorting step in the original and so would be negligible wrt wall time (a quick Reduce).

For large files you should get almost embarrassing parallelism.

jvanderbot · on Nov 26, 2022

Oh I think a few simd instructions could reduce processing to near zero without going crazy with multi-threaded architectures.

Remember that fizzbuzz on HN that hit GB/s? Mostly SIMD. Zero multi-threaded IIRC.

ShredKazoo · on Nov 26, 2022

The GIL won't prevent you from parallelizing I/O will it?

plonk · on Nov 26, 2022

If I/O was the bottleneck, parallelizing it won't help, your SSD/network link/database won't get magically faster.

If I/O wasn't the bottleneck, I guess you can parallelize reading, but what are you gaining?

If you're writing to files, most of the time the parallism will be hard to implement correctly. SQLite doesn't support parallel writes for example.

ShredKazoo · on Nov 26, 2022

>If I/O was the bottleneck, parallelizing it won't help, your SSD/network link/database won't get magically faster.

I think your SSD/network link/database might be able to work in parallel even when Python can't. Details:

Suppose I am scraping a website using a breadth-first approach. I have a long queue of pages to scrape. A single-threaded scraper looks like: pop the next page in the queue, block until the web server returns that page, repeat. A multi-threaded scraper looks like: thread wakes up, pops the next page in the queue, sleeps until the web server returns that page, repeat. With the multi-threaded scraper I can initiate additional downloads while the thread sleeps.

My assumption here is that the download over the network is at some level being performed by making a system call (how could it not be?) And once you have multiple system calls going, they can be as parallel as the OS permits them to be; the OS doesn't have to worry about the GIL. And also the server should be able to serve requests in parallel (assuming for the sake of argument that the server doesn't suffer from the GIL).

Same essential argument applies to the database. Suppose I'm communicating with the database using IPC. The database isn't written in Python and doesn't suffer from the GIL. Multiple Python threads can be sleeping on the database while the database processes their requests, possibly in parallel if the db supports that.

I think this argument could even work for the SSD if the kernel is able to batch your requests in a way that takes advantage of the hardware, according to this person: https://news.ycombinator.com/item?id=33752411

Very curious to hear your thoughts here. Essentially my argument is that the SSD/network link/database could be a "bottleneck" in terms of latency without being the bottleneck in terms of throughput (i.e. it has unused parallel capacity even though it's operating at maximum speed).

plonk · on Nov 26, 2022

You're right, my comment only applies when bandwidth is the bottleneck. In Python, that web parser could probably do even better with asyncio in a single OS thread.

ilyt · on Nov 26, 2022

> If I/O was the bottleneck, parallelizing it won't help, your SSD/network link/database won't get magically faster.

Of course it will. Near every serious DB will allow to work on multiple requests in parallel and unless the DB itself is on something that's slow you will get data faster from 2 parallel requests than from serializing them

NVMe SSDs in particular can easily fill what single thread can read, just run fio with single vs parallel threads to see that.

> If you're writing to files, most of the time the parallism will be hard to implement correctly. SQLite doesn't support parallel writes for example.

That's just one random example. If all you do is "read data, parse ,write data" in some batch job you can have massive parallelism. Sharding is also easy way to fill up the IO.

srcreigh · on Nov 26, 2022

Haven't tried it, but SQLite supports some type of concurrent modification now

https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begi...

therealdrag0 · on Nov 26, 2022

I’ve definitely parallelized http requests for significant improvement in python before.

Waterluvian · on Nov 26, 2022

You can use Processpool but at that point you’re way into re-architecting.

plonk · on Nov 26, 2022

Not much more than the switch to MapReduce and threads. Actually the interface is exactly the same if you use executors from concurrent.futures.

znpy · on Nov 26, 2022

inter-process communication has its own overhead though.

plonk · on Nov 26, 2022

You can usually counteract that by sending large enough batches to the processes.

Dunedan · on Nov 26, 2022

> Nope, the GIL will make that useless.

In Python yes. I missed that. The Go implementation would still benefit from multiple threads, wouldn't it?

plonk · on Nov 26, 2022

Yes I think so.

masklinn · on Nov 26, 2022

A more obvious optimisation (to me) would be to leverage the native functions and avoid creating a list of ~80 million strings in memory.

On my machine, the base script pretty reliably takes ~10s:

    Reading   : 0.1935129165649414
    Processing: 9.955206871032715
    Sorting   : 0.0067043304443359375
    Outputting: 0.01335597038269043
    TOTAL     : 10.168780088424683

Switching content to a no-op (`content = sys.stdin`) and feeding `Counter` from a native iterators pipeline:

    counts = collections.Counter(chain.from_iterable(map(str.split, map(str.lower, content))))

is a pretty reliable 10% gain:

    Reading   : 1.1920928955078125e-06
    Processing: 8.863707780838013
    Sorting   : 0.004117012023925781
    Outputting: 0.012418985366821289
    TOTAL     : 8.880244970321655

As far as I can tell, the bottleneck is about half the preprocessing (lowercasing and splitting) and half filling the Counter.

You won't get a 10x gain out of that though.

superjan · on Nov 26, 2022

That is of course an easy solution but I would argue that this is just throwing more resources at the problem. Not a very impressive optimization.

Emery Berger has a great talk [1] where he argues that it is mostly pointless to optimize python code, if your program is slow, you should look for a properly optimized library to do that work for you.

1: https://www.youtube.com/watch?v=vVUnCXKuNOg

Dunedan · on Nov 26, 2022

> That is of course an easy solution but I would argue that this is just throwing more resources at the problem. Not a very impressive optimization.

You could say the same about the existing implementation as that reads the whole file into memory instead of processing it in chunks.

ummonk · on Nov 26, 2022

Or better yet, write a compute shader since hashing is an embarrassingly parallel operation.

That said, the OP's article is correct in that straightforward idiomatic implementations of this algorithm are very much compute bound. The corollary is that eng work put into optimizing compute usage often won't be waisted for programs processing disk data (or even network data with modern 10Gb fiber connections).

throwaway71271 · on Nov 26, 2022

EBS costs crazy crazy amounts for reasonable iops

We pay 7k per month for RDS that can do barely 2k iops.. in the same time a machine at hetzner does 2 million iops for 250 euro per month (not to mention it also have 4x more codes and 5x more ram).

So, even though I/O is no longer the bottle neck physically, it still is a considerable issue and design challenge on the cloud.

nix23 · on Nov 26, 2022

Well yes it's a total ripoff

I installed a DB-Server for a Customer around 2years ago, in a DC near him with 16 cores 48GB Ram and ~6TB -> 12 SSD, vDevs mirror with 2, and Stripe over the mirrored vDevs (kind of a Raid10 but zfs), compression zstd (1GB could be compressed down to ~200MB so 5 times less reading/writing, and in theory ~30TB of pure DB-Data, 20TB realistic, remember never fill a zpool over 72%) record-size 16kb (postgresql). After 3 month the machine was paid (compared to the "cloud"-price) and the performance kind of 10-12 times higher.

Called the customer about a two month ago and he said the DB-Server is still to fast and maybe he wants another one who uses less power... ;)

ilyt · on Nov 26, 2022

The cloud costs really are between "unreasonable" and "very unreasonable". The only time when it gets cheaper is if workload would be so spiky we coul'dve turned most of it off for 2/3 of the day but most of what we have is many smaller projects that don't even have enough traffic to even need scaling and the big ones, well, can't scale database size down off-peak...

Over last ~6 years we did "is it worth going to cloud" calculation few times and it was always ridiculously more expensive.

fomine3 · on Nov 26, 2022

Virtualized storage system increases latency (QD1 IOPS). Naively built non-parallel apps tend to rely on QD1 IOPS performance, so it runs very slowly on cloud platform, compared to dev machine with direct attached NVMe.

nix23 · on Nov 27, 2022

>Naively built non-parallel apps tend to rely on QD1 IOPS performance

PostgreSQL is pretty much parallel but i know what you mean...the beehive ;)

samb1729 · on Nov 26, 2022

> remember never fill a zpool over 72%

Could you please explain where this number comes from?

ewwhite · on Nov 26, 2022

Yeah, that's a myth now. It's not current advice.

nix23 · on Nov 26, 2022

72% is my rule of thumb for write heavy production stuff (my absolute limit would be 75%) but it depends on record-size, raidz-level, if you have mostly write or mostly read workloads, how big your files are, how many snapshots you have, if you have a dedicated ZIL-Device and much more. For a Home-NAS (Movies etc) you can easy go up to 85%...if it's a "~WORM" workload maybe 90%...but resilvering can then be a thing of days (weeks?), depends on the raidz-level or mirror etc.

>Yeah, that's a myth now. It's not current advice.

It's not and you know it, keep it under 72% believe me if you want a performant zfs (especially if you delete files and have many snapshots...check the YT linked at the end)

>>Keep pool space under 80% utilization to maintain pool performance. Currently, pool performance can degrade when a pool is very full and file systems are updated frequently, such as on a busy mail server. Full pools might cause a performance penalty, but no other issues. If the primary workload is immutable files (write once, never remove), then you can keep a pool in the 95-96% utilization range. Keep in mind that even with mostly static content in the 95-96% range, write, read, and resilvering performance might suffer.

https://web.archive.org/web/20150905142644/http://www.solari...

And under no circumstances go over 90%:

https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

>An introduction to the implementation of ZFS - Kirk McKusick

https://www.youtube.com/watch?v=TQe-nnJPNF8

LastTrain · on Nov 26, 2022

Agree! And at the end of the day, we are optimizing for cost. Although, the EBS portion of that 7k RDS bill is going to be tiny, right?

throwaway71271 · on Nov 26, 2022

well its tiny if you keep it 2k, and make sure you dont touch disk much, god forbid someone makes a query that requires a temp table.. a query you wouldnt even notice in a baremetal machine brings the whole rds setup down, cant even read the writeahead log and cant replicate and etc.. its like watching a slow motion train wreck from 2 queries per second

SleepyMyroslav · on Nov 26, 2022

the state of benchmarking by normal IT people is tragic. If one checks out his 'optimization problem statement' article [1] they can find:

>ASCII: it’s okay to only support ASCII

>Threading: it should run in a single thread on a single machine

>Stdlib: only use the language’s standard library functions.

This is truly 1978 all over again. No flame graphs, no hardware counters, no bottleneck analysis. Using these 'optimizations' for job interviews is questionable at best.

[1] https://benhoyt.com/writings/count-words/

benhoyt · on Nov 27, 2022

I'm happy to be a "normal IT person". :-)

If you look further at the count-words article you linked, I do have profiling graphs and bottleneck analysis.

Note that the interview questions I ask are open-ended, not trying to trick or trap people into giving the "wrong" answer. I like to have more of a discussion to see how they think about the problem, what data structures they'd use, how they'd profile, and so on.

SleepyMyroslav · on Nov 29, 2022

sry for late reply. I am glad that you are raising these topics regularly tbh. If more people stop write only coding and ask themselves questions you have asked it will be already big step forward. If they ask what profiling tools they should use it will be great. I did had a look into your article and I can not recommend limiting to tools you have used.

I come from gamedev low level coding and performance analysis so I understand that my point of view is not normal xD

chakkepolja · on Nov 26, 2022

Still better than obscure question from page 20 of Leetcode.

MrLeap · on Nov 26, 2022

Interesting! This made me wonder -- would this kind of optimization be recognized and rewarded in colossal scale organizations?

I've seen comments about Google multiple times here where people say you wont be getting promotions unless you're shipping new things -- maintaining the old wont do it.

But if you get to something core enough, it seems like the numbers would be pretty tangible and easy to point to during perf review time?

"Found a smoother way to sort numbers that reduced the "whirrrrrr" noise our disks made. It turns out this reduces disk failure rates by 1%, arrested nanoscale structural damage to the buildings our servers are in, allowed a reduction in necessary PPE, elongaded depreciation offsets and other things -- this one line of code has saved Google a billion dollars. That's why my compensation should be increased to include allowing me to fall limply into the arms of another and be carried, drooling, into the office, where others will dress me"

In this hypothetical scenario, would a Googler be told "Your request has been approved, it may take one or two payment periods before your new benefits break into your apartment" or "No, you need to ship another chat program before you're eligible for that."?

hansvm · on Nov 26, 2022

Yeah, when I was there I saw plenty of <1% optimizations saving REDACTED gobs of money, and people were rewarded for it. I don't think it's applicable to most teams though.

Imagine a foo/bar/widget app that only serves 20B people (obvious exaggeration to illustrate the point) and is only necessary up to a few hundred times per day. You can handle that sort of traffic on a laptop on my home router and still have enough hootzpah left to stream netflix. I mean, you are Google, and you need to do something better than that [0], but the hardware for your project is going to be negligible compared to other concerns unless you're doing FHE or video transcoding or something extraordinarily expensive.

Walk that backward to, how many teams have 20B users or are doing extraordinarily expensive things? I don't have any clue, but when you look at public examples of cheap things that never got much traction and probably had a suite of engineers [1], I'd imagine it's not everyone in any case. You're probably mostly looking at people with enough seniority to be able to choose to work on core code affecting most services.

[0] https://www.youtube.com/watch?v=3t6L-FlfeaI

[1] https://killedbygoogle.com/

ilyt · on Nov 26, 2022

Google seems to have problem of "you won't get promotion/raise if you don't work on something new" so they are not interested by services that just work and provide tidy little constant revenue

surajrmal · on Nov 26, 2022

To some extent it's true on a macro scale, but it's a trope to say this is exclusively the case. I would say it is easier to paint the picture for promotion by working on new things but not by a large amount. The vast majority of folks at Google work on maintenance, as would be expected.

johnfn · on Nov 26, 2022

> Found a smoother way to sort numbers that reduced the "whirrrrrr" noise our disks made. It turns out this reduces disk failure rates by 1%, arrested nanoscale structural damage to the buildings our servers are in, allowed a reduction in necessary PPE, elongaded depreciation offsets and other things -- this one line of code has saved Google a billion dollars

Hah! I mean, if you can truly prove a business benefit by improving performance, I’m sure that you’d have a good shot at a promotion. Thing is it’s actually quite difficult to do so, and in the likely chance you cannot it just looks like you squandered a bunch of time for no reason.

anthlax · on Nov 26, 2022

Metrics are religiously collected and any sizable performance improvement will have a clear impact on one metric or another no?

kevinventullo · on Nov 26, 2022

Pay increases and promotions are for the most part a little more formulaic and upper-bounded than what you describe, but generally speaking if you can prove you saved the company $XM you will be rewarded (from firsthand experience, this is also true at Meta).

TeMPOraL · on Nov 26, 2022

"Promotion? You must be joking. Your little stunt caused a revenue spike that pushed us over the threshold in $country, prompting $regulatory-agency to look into starting antitrust proceedings. Mitigating this will require an army of lawyers, that will cost approximately a third of the $1B you 'saved' us. Additionally, we will now have to create another throwaway chat app, for which we'll allocate another third of a billion in the budget. The final third... will go to executive bonuses, obviously.

You are hereby placed on a Performance Improvement Plan, starting tomorrow. On the off chance you'll come out of the other end still employed, keep in mind that your manager isn't being stupid by forbidding such 'optimizations', they're just following orders."

PaulKeeble · on Nov 26, 2022

They likely have become one of the 10k getting fired for "underperforming". Company management in basically all businesses don't like know it alls that do the opposite of what they are told regardless of the outcome.

plonk · on Nov 26, 2022

Google's core services wouldn't work so reliably if they didn't value optimization and engineering. I don't work there, but I'm pretty sure that the SREs and the developers behind Search and Maps don't get fired based on how many products they launched.

vitiral · on Nov 26, 2022

Is there a valid source for this number? All the articles I see seem to quote each other.

kevinventullo · on Nov 26, 2022

You don’t know what you’re talking about.

evmar · on Nov 26, 2022

[former Googler]

Yes, I occasionally saw people get highlighted for making optimizations like "this saves 1% in [some important service]". When you're running millions of machines, 1% is a lot of machines. However, it's also likely the case that the easy 1%s have already been found...

xyzzy_plugh · on Nov 26, 2022

> This made me wonder -- would this kind of optimization be recognized and rewarded in colossal scale organizations?

It depends if this kind of optimization is valuable to the organization. Often times it's not. Spending money and time to save money and time is often viewed as less efficient than generating more revenue.

mastax · on Nov 26, 2022

I was recently working on parsing 100K CSV files and inserting them into a database. The files have a non-column-oriented header and other idiosyncrasies so they can't be directly imported easily. They're stored on an HDD so my first instinct was to focus on I/O: read the whole file into memory as an async operation so that there are fewer larger IOs to help the HDD and so that other parser tasks can do work while waiting for the read to complete. I used a pretty featureful C# CSV parsing library which did pretty well on benchmarks [0] (CsvHelper) so I wasn't really worried about that part.

But that intuition was completely wrong. The 100K CSV files only add up to about 2GB. Despite being many small files reading through them all is pretty fast the first time, even on Windows, and then they're in the cache and you can ripgrep through them all almost instantaneously. The pretty fast parser library is fast because it uses runtime code generation for the specific object type that is being deserialized. The overhead of allocating a bunch of complex parser and typeconverter objects, doing reflection on the parsed types, and generating code for a parser, means that for parsing lots of tiny files its really slow.

I had to stop worrying about it because 2 minutes is fast enough for a batch import process but it bothers me still.

Edit: CsvHelper doesn't have APIs to reuse parser objects. I tested patching in a ConcurrentDictionary to cache the generated code and it massively sped up the import. But again it was fast enough and I couldn't let myself get nerd sniped.

Edit2: the import process would run in production on a server with low average load, 256GB RAM, and ZFS with zstd compression. So the CSV files will live permanently in the page cache and ZFS ARC. The import will probably run a few dozen times a day to catch changes. IO is really not going to be the problem. In fact, it would probably speed things up to switch to synchronous reads and remove all the async overhead. Oh well.

[0]: https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-pa...

samsquire · on Nov 26, 2022

My immediate thought was are you measuring throughput or latency?

The latency of reading from disk is indeed very slow compared to CPU instructions.

A 3ghz clock speed processor is running 3 billion (3,000,000,000 cycles a second) and some instructions take 1 cycle. You get 3 cycles per nanosecond. A SSD or spinning disk access costs many multiples of cycles.

Read 1 MB sequentially from SSD* 1,000,000

That's a lot of time that could be spent doing additions or looping.

https://gist.github.com/jboner/2841832

noctune · on Nov 26, 2022

That is true, but assuming file I/O is blocking you also have to pay for a context switch to take advantage of that.

But I guess you could avoid that using eg. io_uring.

formerly_proven · on Nov 26, 2022

And you get 2-4 instructions per cycle on average.

cb321 · on Nov 26, 2022

I agree - "IO bandwidth is no longer a BW bottleneck" would be a better title.

austin-cheney · on Nov 26, 2022

I encountered this myself yesterday when attempting to performance test WebSockets in JavaScript: https://github.com/prettydiff/share-file-systems/blob/master...

The parsing challenge is complex enough that it will always be faster to extract the data from the network than it is to process it. As a result excess data must be stored until it can be evaluated or else it must be dropped, therefore the primary processing limitation is memory access not CPU speed executing instructions. JavaScript is a garbage collected language, so you are at the mercy of the language and it doesn't really matter how you write the code because if the message input frequency is high enough and large enough memory will always be the bottleneck, not the network or the application code.

In terms of numbers this is provable. When testing WebSocket performance on my old desktop with DDR3 memory I was sending messages (without a queue or any kind of safety consideration) at about 180,000 messages per second. In my laptop with DDR4 memory the same test indicated a message send speed at about 420,000 messages per second. The CPU in the old desktop is faster and more powerful than the CPU in the laptop.

fancyfredbot · on Nov 26, 2022

NVME storage really is very fast for sequential reads, but I'd respectfully suggest that for simple tasks a Dell laptop with 1.6GB/s read speed should be bottlenecked by IO if the compute is optimised. For example SIMD-json can parse json at over 7GB/s. https://simdjson.org/

brianolson · on Nov 26, 2022

SSD is pretty fast; but my app is actually trying to do more than 100_000 read-modify-write cycles per second and that still requires careful thought about the database and schema we're using.

CPU and RAM are pretty fast. I do a live-coding interview question and I ask people do to a naive implementation first, then later I ask about possible optimizations. A third to a half of candidates want to do fewer RAM accesses and oh by is that the wrong avenue for this problem - especially when they just wrote their solution in Python and you could get a 10x-20x speedup by rewrite in C/C++/Go/Rust/etc.

Network is IO too. Network is pretty fast, datacenter-to-datacenter, but end users can still have their experience improved with better encoding and protocol; and outbound bandwidth bills can be improved by that too.

Agingcoder · on Nov 27, 2022

Well, it depends on how far your datacenters are - in the end you're still limited by the laws of physics (speed of light). So 'fast' may actually be several ms, which might be a lot, or not a lot, depending on the problem you're trying to solve.

chewbacha · on Nov 26, 2022

Wouldn’t memory allocation still be IO of a different resource? We’re still getting slowed down reading and writing bits to a storage device. Perhaps it’s not the hard drive but the claimed blocker here doesn’t appear to be CPU.

umanwizard · on Nov 26, 2022

Reading and writing to main memory is not usually called “IO” in this context.

diarrhea · on Nov 26, 2022

That discussion reminds me of Intel Optane. The current distinction between hard disks and RAM isn’t a necessity dictated by some law of nature. Yet it’s the box most people think within (for good reason).

dahfizz · on Nov 26, 2022

Not really, no. Allocation involves reading/writing to memory, but that's not why it's slow. It's slow because allocation involves a context switch into the kernel. And an allocation algorithm itself isn't trivial.

zabzonk · on Nov 26, 2022

user space code will not use the kernel and there will be no context switch. it will call a function such as malloc, which is a user-space function. malloc will then go on to interact with the memory allocation sub-system, which may occasionally need to ask the os for more memory via a system call, if the allocator is any good.

dahfizz · on Nov 26, 2022

Yeah malloc is user space code, but it will do a syscall like sbrk to actually get memory from the kernel.

The default malloc in glibc does not pad the values given to sbrk, so you have to do a syscall for every 4k chunk of memory (the pagesize). So unless you do lots of very small (<<4k) allocations, you call sbrk pretty often.

You will also page fault when you access the new pages, and this traps into kernel code again.

So yeah, you are technically correct that some allocations may be fast because the memory is already available and mapped. Allocations, on average, are still slow because it involves context switches to the kernel (potentially multiple).

TLDR: you make it sound like a syscall within malloc is rare, but many/most allocations will trigger a syscall.

Agingcoder · on Nov 27, 2022

No, for a start some allocators don't even use sbrk (all mmap), and glibc's ptmalloc will use mmap for large allocations, ie you can allocate multiple megabytes with a single syscall with ptmalloc. Granted, if memory has been mmaped, you will pay a page fault because of the zero page optimization when you first write to it.

Furthermore, memory which is freed is often not returned to the os, either for fragmentation (you've used sbrk..) , or performance reasons (minimize syscalls), and put in a free list instead. The next call to malloc then will not require a syscall, if it can be satisfied with existing freed blocks.

morelisp · on Nov 26, 2022

It is rare because one of the very first things anyone does if they're concerned about allocations is replace glibc malloc.

cb321 · on Nov 26, 2022

Indeed - for that particular problem a big cost is the "IO" from main memory "into the CPU cache". Ben is careful to qualify it as "disk" IO, but even this is somewhat vague. (as is "CPU cache" vs L3/L2/L1 - Ben's interview problem is highly L2 sensitive, I think, and increasing variety in that size will make results harder to interpret.)

On a modern gen4 NVMe, I routinely get 7 GiB/s. gen5 is supposed to double that (as soon as manufactures get "enough" money out of gen4 given PCIe4's extremely short life compared to gen3's extremely long one.)

There was a time not long ago (maybe still) where highly scaled up, many core (40+) Intel CPUs could not match that getting from DIMMs into L3 for just 1 core (as per his interview problem). So, we are indeed moving into an era where "IO" from the primary persistent device is indeed no worse than IO from DIMMs, at least in bandwidth terms. DIMMs still have much better latency and the latency-BW ambiguity has been observed elsethread.

EDIT: I should clarify, to connect my text with your comment, that the real cost of (1-core, uncontended) allocation is also more "populating/mapping the cache" with copies, not just "allocation" in itself.

mrkeen · on Nov 26, 2022

A few ballpark numbers I encountered:

Sequentially reading a file on a spinny laptop disk was about 80-100 MB/s. On an SSD that went up to 400-500 MB/s for me.

That's the sequential case! What about random access? I tried an experiment where I memory mapped a large file and started updating bytes at random. I could get the rate down to kilobytes/sec.

Even though we've all heard that SSDs don't pay as much as a penalty for random access as spinny disks, it's still a huge penalty. Sequential spinny disk access is faster than SSD random access.

wtallis · on Nov 26, 2022

> I tried an experiment where I memory mapped a large file and started updating bytes at random. I could get the rate down to kilobytes/sec.

Memory-mapped IO means you're only giving the SSD one request to work on at a time, because a thread can only page fault on one page at a time. An SSD can only reach its peak random IO throughput if you give it lots of requests to work on in parallel. Additionally, your test was probably doing small writes with all volatile caching disallowed, forcing the (presumably consumer rather than enterprise) SSD to perform read-modify-write cycles not just of the 4kB virtual memory pages the OS works with, but also the larger native flash memory page size (commonly 16kB). If you'd been testing only read performance, or permitted a normal degree of write caching, you would have seen far higher performance.

d_tr · on Nov 26, 2022

True. And main memory access can easily become slower than SSD sequential access if you do random byte accesses and your working set is larger than the CPU's caches or TLBs.

GrayShade · on Nov 26, 2022

> Sequential spinny disk access is faster than SSD random access.

It is, but on both kind of drives you'll want to dispatch at least a couple of requests at once to get better performance. In the memory-mapped case, that means using multiple threads.

In addition, you might also want to call madvise(MADV_RANDOM) on the mapping.

mrkeen · on Nov 26, 2022

I didn't think there'd be that much discussion on my point (seq hdd > rand sdd).

> In the memory-mapped case, that means using multiple threads.

My gut tells me I'd lose more to contention/false-sharing than I'd gain through multithreading - but I haven't done the experiment.

koolba · on Nov 26, 2022

> Sequential spinny disk access is faster than SSD random access.

No it’s not. At least not with modern SSDs or NVMe storage.

Even at 100 MB/s, a spinning disk in sequential mode is doing 100 x 1024 / 4 = 25,600 IOPS (assuming a standard 4K per operation).

Even consumer grade NVMe hardware gets 5-10x of that for random workloads.

mrkeen · on Nov 26, 2022

> Even consumer grade NVMe hardware gets 5-10x of that for random workloads.

Cool, lots of IOPS!

But like I said, I got it down to kilobytes/sec.

justsomehnguy · on Nov 26, 2022

Because you did it the most inefficient way.

This [0] comment is totally on point.

Also note what a consumer SSDs can be made even with a single flash chip. A more performant ones are made of bunch of chips internally (essentially a RAID0 with some magic) so they can do a parallel operations if the data resides on the different flash blocks. Still, if your thread is only doing one operation a time with blocks < flash rewrite block size you will hit the write amplification anyway.

I think if you do the same test but without a memory mapped file (ie let the OS and disk subsystem do their thing) you will get much more speed.

[0] https://news.ycombinator.com/item?id=33751973

koolba · on Nov 26, 2022

On what hardware and how much parallelization? The max IOPS numbers are only when you’re saturating the command queue.

It’s a throughput number, not a single operation, completion, followed by the next one.

kragen · on Nov 26, 2022

it is true that spinning rust is slower than most ssds including nvme ssds even in sequential mode

however, a spinning disk doing a sequential access is not doing 25600 iops

if the sequential access lasts 10 seconds it is doing 0.1 iops

Dunedan · on Nov 26, 2022

As the algorithm used in the example is straight-forward I figured that using UNIX command line tools might be an even simpler way to implement it. Here is what I came up with:

  time cat kjvbible_x100.txt | tr "[:upper:] " "[:lower:]\n" | sort --buffer-size=50M | uniq -c | sort -hr > /dev/null

On my machine this turned out to be ~5 times slower than the provided Python implementation. Nearly all of the time is spent in the first invocation of `sort`. Further increasing the buffer size doesn't make a significant difference. I also played around with the number of threads `sort` uses, but didn't see any improvement there either.

I'm quite puzzled why `sort is so much slower, especially as it does sorting in parallel utilizing multiple CPU cores, while the Python implementation is single-threaded.

Does somebody have an explanation for that?

zackmorris · on Nov 26, 2022

Dangit, I'm supposed to be doing yardwork today so you hijacked my procrastination motivation haha!

Edit: I had no idea that awk was so fast, and I suspect that only parallelization would beat it. but I agree with the others that the main bottleneck is the `sort | uniq` for results1.txt

  # https://stackoverflow.com/a/27986512 # count word occurrences
  # https://unix.stackexchange.com/a/205854 # trim surrounding whitespace
  # https://linuxhint.com/awk_trim_whitespace/ # trim leading or trailing whitespace
  
  time cat kjvbible_x100.txt | tr "[:upper:] " "[:lower:]\n" | sort --buffer-size=50M | uniq -c | sort -hr > results1.txt
  real 0m13.852s
  user 0m13.836s
  sys 0m0.229s
  
  time cat kjvbible_x100.txt | tr "[:upper:] " "[:lower:]\n" | awk '{count[$1]++} END {for (word in count) print count[word], word}' | sort -hr > results2.txt
  real 0m1.425s
  user 0m2.243s
  sys 0m0.061s
  
  diff results1.txt results2.txt
  109,39133c109,39133
  # many whitespace differences due to how `uniq -c` left-pads first column with space
  
  diff <(cat results1.txt | awk '{$1=$1};1') <(cat results2.txt | awk '{$1=$1};1')
  # bash-only due to <() inline file, no differences after trimming surrounding whitespace
  
  cat results1.txt | awk '{ sub(/^[ \t]+/, ""); print }' | diff - results2.txt
  # sh-compatible, no differences after trimming leading whitespace of results1.txt
  
  # 13.836 / 2.243 = ~6x speedup with awk

TristanBall · on Nov 26, 2022

I don't understand how you're getting <2s for that awk result. I'm testing on slightly older hardware, for example I get 4.6s and 11.9s for the optimized and simple go versions taken from the git repo.

But when I also get:

# time cat kjvbible_x100.txt | tr "[:upper:] " "[:lower:]\n" | awk '{count[$1]++} END {for (word in count) print count[word], word}' | sort -hr > results2.txt

real 0m23.174s user 0m23.309s sys 0m1.234s

So my result is 10x slower than yours.

What are you running this on and where do I get one?

zackmorris · on Nov 28, 2022

Oh gosh, sorry to burst your bubble but it's a 2011 Mac Mini :-P

  2.3 GHz Intel Core i5
  8 GB 1333 MHz DDR3
  Intel HD Graphics 3000 512 MB
  macOS High Sierra 10.13.6 (17G14042)
  512 GB PLEXTOR PX-512M5Pro SSD (Get Info says I installed it July 2, 2011 but it might be a clone of another drive)

<rant>

I really like it, but will probably have to sell it because it has various software failures, like sometimes one of my displays won't turn on or goes black and I have to restart. That bug seems to be fixed on newer macOSs like the one on an Intel MacBook Pro I use for work, but Apple artificially sunsets their hardware by preventing newer versions of macOS from being installed and not back-porting bug fixes to previous macOSs. Since pretty much all computers today are Turing-complete, that feels.. disingenuous.

Computers haven't gotten appreciably faster for roughly 15 years since R&D funding shifted to mobile in 2007 and Moore's Law ended. All that matters today is whether we are using an SSD and how wide the memory bus is, since speed there hasn't changed much either, just latency. And Apple's not the only one treading water. PCs often suffer from mismatched hardware, so maybe an Intel i9 gets installed on a logic board with a memory bus too slow to recruit it. I built a gaming PC a few years back and I may have inadvertently underpowered it by putting most of the budget into the RTX 2070. Since video cards can't do the everyday workloads we're discussing, I mostly consider them a waste of time and mourn what might have been had CPUs kept improving instead.

Apple's Arm M1 is a logical progression off of Intel, but I can't really endorse it, since they chose a relatively complex architecture where a big dumb array of cores would have been more scalable. If some indie brand comes along and builds one of the 1000+ core CPUs I've blabbered on about, I can't say that I'll have much sympathy for the current big players.

Due to all of that, I perceived computers in 2010 as being roughly 1000 times slower than they could/should be had they kept up with Moore's Law, and computers in 2020 as being roughly 1000000 times slower (the ratio of GPU to CPU FLOPs for example). It doesn't help that stuff like Spotlight and Safari eagerly take 100+% CPU or that basically all PCs are bogged down with either spyware or the daemons that supposedly find and remove spyware (thank you M$). Or that we don't have the network computing that Sparc had in the 1990s, where all of the computers on the LAN were available for additional cores seamlessly. Just slow on top of slow on top of slow under surveillance capitalism yay!

</rant>

benhoyt · on Nov 26, 2022

It's because that first invocation of sort is sorting the entire input (413MB), not just the unique words (less than a MB). The sort is probably O(NlogN), but that's a big N. Counting by inserting into a hash table is much faster, at O(N).

mastax · on Nov 26, 2022

`sort | uniq` is really slow for this, as it has to sort the entire input first. I use `huniq` which is way faster for this. I'm sure there are many similar options.

https://github.com/koraa/huniq

aljarry · on Nov 26, 2022

It looks like you're sorting the whole file, while python implementation sorts only unique values.

nanis · on Nov 26, 2022

Related posts from my past experience from about 10 years ago:

* Splitting long lines is slow[1]

* Can Parallel::ForkManager speed up a seemingly IO bound task?[2]

In both cases, Perl is the language used (with a little C thrown in for [1]), but they are in a similar vein to the topic of this post. In [1], I show that the slowness in processing large files line by line is not due to I/O, but due to the amount of work done by code. In [2], a seemingly I/O bound task is sped up by throwing more CPU at it.

[1]: https://www.nu42.com/2013/02/splitting-long-lines-is-slow.ht...

[2]: https://www.nu42.com/2012/04/can-parallelforkmanager-speed-u...

iAm25626 · on Nov 26, 2022

IO also meant network to me. Often the target(database, or device generating telemetry) is 10+ms away. That round trip time is bottle neck by physics(speed of light). side benefit of sqlite being local file system/memory.

prvt · on Nov 26, 2022

"sorting with O(n^2) is no longer a bottleneck as we have fast processors" /s

itissid · on Nov 26, 2022

That makes no sense. Just the brutal math of a polynomial is always going to be poor enough to notice than subpolynomial times.

vitiral · on Nov 26, 2022

Often but not always. Cool trick: any bounded limit is always O(1)!

Pick a small enough bound and an O(n^2) algorithm behaves better than an O(n log n). This is why insertion sort is used for sorting lengths less than ~64, for example.

Dylan16807 · on Nov 27, 2022

Pick a small enough bound and certain O(n^2) algorithms will behave better than certain O(n log n) algorithms.

Big O notation doesn't take into account constant factors of overhead or plain old once-per-run overhead.

vitiral · on Nov 27, 2022

Sorry it was indeed a typo

stabbles · on Nov 26, 2022

The title I/O is _no longer_ the bottleneck seems to suggest disk speed has caught up, while in reality the slowness is due to poor implementation (slow Python or Go with lots of allocations).

The real problem to me is that languages are too high-level and hiding temporary allocations too much. If you had to write this in C, you would naturally avoid unnecessary allocations, cause alloc / free in the hot loop looks bad.

Presumably soon enough it's very unlikely you find any new word (actually it's 10 passes over the same text) and most keys exist in the hashmap, so it would be doing a lookup and incrementing a counter, which should not require allocations.

Edit: OK, I've ran OP's optimize C-version [1] and indeed, it only hits 270MB/s. So, OP's point remains valid. Perf tells me that 23% of all cache refs are misses, so I wonder if it can be optimized to group counters of common words together.

[1] https://benhoyt.com/writings/count-words/

xyzzy_plugh · on Nov 26, 2022

I don't know if I can agree. Python is certainly problematic for the reasons you outlined, but Go is much closer to C in terms of performance. Go is also compiled, and as such you have similar opportunity to optimize and measure allocs. In my experience it's much harder to tune Python, and Go is often easier to tune than C due to the great ergonomics of the compiler. You can also drop down to assembly!

But really, I disagree because I've frequently saturated massive IOPS. I/O is still the bottleneck. The article pretty much immediately excludes network I/O, which is in many cases more common than disk I/O. Even so, tiny single-threaded programs reading words one-at-a-time are obviously not going to be I/O constrained with modern disks. For these types of programs, I/O hasn't been a bottleneck in a long, long time, and I'd actually be surprised to hear candidates suggest otherwise.

exabrial · on Nov 26, 2022

Let’s call spades spades: C spanks Go in every measure. Go is nowhere near C speed.

It’s certainly more performant than any dynamically typed scripting language: JavaScript, Python, Ruby, etc but it’s probably closer to C#.

sidlls · on Nov 26, 2022

Go’s problem is that in practice there are interfaces (fat pointers) everywhere. I’m about to start optimizing one of the services my team owns and my first step is going to be looking for all the bad performance issues that comes out of this.

kragen · on Nov 26, 2022

yeah, a friend of mine wrote a logfile parser in 01996 in c, and when i rewrote it in perl (also in 01996) it was about one tenth the size and ten times slower

whether i/o is the bottleneck depends on what you're doing and on which computer, and that's been true for at least 50 years

computator · on Nov 26, 2022

May I ask what's the rationale for writing 01996 rather than 1996? I've seen this before but I haven't seen an explanation of the advantage of it.

mcculley · on Nov 26, 2022

Y-10K awareness: https://longnow.org/ideas/long-now-years-five-digit-dates-an...

bqmjjx0kac · on Nov 26, 2022

But it looks like an octal literal.

withinboredom · on Nov 26, 2022

Haha, I thought it was some architecture I’ve never heard of.

dahfizz · on Nov 26, 2022

I mean, Go is closer to C than Python is. But Go is still a garbage collected language with a runtime. It's not going to come close to the limits of performance that C offers

slashdev · on Nov 26, 2022

For a script like this your be surprised. If you don't do too much allocation, the garbage collector need not run. In fact it's common to disable it at the start. Now you can allocate much faster than C because you never clean up after yourself!

A real world example is esbuild, the author implemented it both Rust and Go initially. The Go version was faster and the code simpler. Which is why it's implemented in Go.

leethaxor · on Nov 26, 2022

> A real world example is esbuild, the author implemented it both Rust and Go initially. The Go version was faster and the code simpler. Which is why it's implemented in Go.

But why is swc faster than esbuild then? The code isn't even considerably more complex.

slashdev · on Nov 26, 2022

Because different programs, implemented differently, run at different speeds...

I'm saying the performance of Go can sometimes be surprisingly fast. Not that it's magic.

leethaxor · on Nov 26, 2022

So you're saying they wrote exactly the same program in Go and Rust for the comparison, changing the syntax only? Well then it's no surprise the Go version was faster.

Don't write Rust as if it was Go. That doesn't say anything meaningful about either Go or Rust.

slashdev · on Nov 26, 2022

Actually he wrote the Rust version first, so you're wrong jumping to conclusions.

I'm not trying to say Go is faster than Rust, it's usually slower. But there are always exceptions to the rule. The Go code, on the other hand, is usually simpler and quicker to write. For that reason I'd prefer Go if the problem lends itself to a garbage collected language.

leethaxor · on Nov 26, 2022

So what did you mean by this?

> Because different programs, implemented differently, run at different speeds...

We're talking about two programs with exactly the same purpose - ingest TypeScript and output JavaScript. It's a pretty clear-cut comparison, IMHO.

> The Go code, on the other hand, is usually simpler and quicker to write

I'm writing Go code at work, and Rust code mostly for fun (but used it at work too). I'd say this has changed significantly in the last 2 years. Now with rust-analyzer and much improved compiler output, writing Rust is very simple and quick too. I guess getting into Rust can be a little harder if you've only ever used GCed languages before, but it's not that hard to learn - and once you do it's super-effective. And the type inference of Rust is a huge reason why I'm using it - while Go has none.

Another thing to consider - usually the code in Go is much more about writing algorithms yourself instead of using library functionality (this is changing slowly thanks to the new support of generics but most code hasn't caught up yet and there aren't good libs using it so far). The resulting code in Go can be convoluted a lot and contain very hidden bugs. People also usually don't bother implementing a proper search/sorting algorithm for the sake of simplicity/speed of development - which you'd get automatically if you used a library function - so the code is less efficient. My Go code is usually 2-3x longer than the equivalent in TypeScript or Rust.

Go is great, I like it. Rust is great too. I recommend you to do what the esbuild author did - test it and choose for yourself, don't bother too much about others' opinion.

slashdev · on Nov 26, 2022

> We're talking about two programs with exactly the same purpose - ingest TypeScript and output JavaScript. It's a pretty clear-cut comparison, IMHO.

There are an infinite number of ways to design two programs for that task, with different trade-offs. You can't draw conclusions about which language is faster based on two different implementations by different people.

> Go is great, I like it. Rust is great too. I recommend you to do what the esbuild author did - test it and choose for yourself, don't bother too much about others' opinion.

I'm actually writing Rust code the last two years. It's been a while since I've used Go. But I'd rather use Go if the problem allows for a garbage collector. It's just simpler than managing it manually in Rust with the borrow checker and its rules. This is my opinion, nobody else's.

stavros · on Nov 26, 2022

The GP didn't say that, maybe they wrote Go as if it were Rust, for all we know.

tail_exchange · on Nov 26, 2022

You can find the comments from the ESBuild author here on HN:

https://news.ycombinator.com/item?id=22336284

leethaxor · on Nov 26, 2022

Rust compiler is much faster today than it was in 2020, btw.

> This is a side project and it has to be fun for me to work on it.

I respect this 100% - but then we shouldn't assume Go is better than Rust just based on that esbuild used it instead of Rust.

cma · on Nov 26, 2022

You could LD_PRELOAD a free that doesn't do anything in C too or a macro define to really make it no cost.

slashdev · on Nov 26, 2022

You'd need to change malloc too so it doesn't do extra work. You could do it. Go obviously isn't going to be faster than well written C. But sometimes you can be surprised.

morelisp · on Nov 26, 2022

Having played with this a previous time it was posted and looking at the profiler results, the main difference between the Go and C isn't due to any memory allocations but rather the lack of in-place update and slightly higher cost from being general-purpose in Go's hash table. (This is also probably why Rust, without the GC overhead, clusters with Go rather than C.)

sidlls · on Nov 26, 2022

I'm fairly certain that if one wrote C (or C++) programs with the same safety targets that Rust has out-of-the-box the performance is close, if not (for practical purposes) identical.

Of course, to do much useful (and performant) in Rust one often has to break out `unsafe`, which eliminates some of the out-of-the-box guarantees for safety--and in some cases makes one wonder if it's worth all the overhead instead of just using C or C++.

morelisp · on Nov 26, 2022

> the same safety targets

Rust's selling point is that the safety targets' costs are dev/compile-time ones. There should not be a difference unless the C/C++ code requires some IB or extremely manual memory management trickery, which it doesn't; and Go offers basically the same memory safety guarantees as Rust in this regard and is (slightly) faster.

In this case it's really almost entirely about the speed of the hash table.

sidlls · on Nov 26, 2022

I was referring to the overhead of development time. Rust is more complex than C, at least, and even after having gained proficiency coding in Rust still takes more time than writing an equivalent C program.

And "zero-cost" is misleading. There are definitely performance impacts from the implicit (and unadvertised explicit) bounds checking some of Rust's features come with. Writing a C program to an equivalent level of safety would have similar performance impacts. Hence, for as close to the same safety as possible, Rust and C should be almost identical in terms of performance.

k__ · on Nov 26, 2022

That's probably the reason why Go got many people switching from their scripting languages and didn't end up as the C killer it was sold in the beginning.

pdimitar · on Nov 26, 2022

That's exactly my reason to use it: fast prototyping and ability to replace bash scripts (as they become too unwieldy to maintain and expand from one point and on).

I've also successfully made an MVP with Golang which I then proceeded to rewrite in Rust in almost only one go and almost without blockers along the way.

Golang is pretty good but it still lacks important things like algebraic data types, and they're hugely important for fearless refactoring and correctness.

taf2 · on Nov 26, 2022

I never remember it being sold as a C killer maybe as a Java killer?

habibur · on Nov 26, 2022

It was sold as C killer in the beginning.

And then a few years later there was an article that said, the Go engineers were surprised when they saw C/C++ coders weren't switching to Go rather Python/Ruby coders were "upgrading" to Go.

capableweb · on Nov 26, 2022

golang.org circa 2010:

> go is ... fast

> Go compilers produce fast code fast. Typical builds take a fraction of a second yet the resulting programs run nearly as quickly as comparable C or C++ code.

https://web.archive.org/web/20100217123645/http://golang.org...

That seems to me like they were trying to say "If you want C/C++ performance but nicer/easier syntax, you can use Go", which turned out to be not that true in the end.

Edit: the old "Language Design FAQ" also goes further in detail on how the envision (the first version of the) language: https://web.archive.org/web/20100211104313/http://golang.org...

forty · on Nov 26, 2022

I believe Go was created to replace python at Google.

dagw · on Nov 26, 2022

It was more sold as a C++ killer than a C killer.

morelisp · on Nov 26, 2022

The Go code is reasonably efficient at avoiding allocations, though it's hard to avoid some. Without thinking too hard, it's also going to be hard to avoid those same ones in C, and some possible improvements would apply to both (e.g. defining a maximum word length).

"Lowercase word count" is a surprisingly difficult case in this regard, because you need to check and potentially transform each character individually, and also store a normalized form of each word. Probably some smart SIMD lowercase function could help here but I don't think any language is going to offer that out of the box. It's also defined in a way I think detaches a bit much from real-world issues - it's handling arbitrary bytes but also only ASCII. If it had to handle UTF-8 it would be very different; but also if it could make assumptions that only a few control characters were relevant.

stabbles · on Nov 26, 2022

> but I don't think any language is going to offer that out of the box.

That's what compilers are for. I tried to improve the C version to make it friendlier to the compiler. Clang does a decent job:

https://godbolt.org/z/o35edavPn

I'm getting 1.325s (321MB/s) instead of 1.506s (282MB/s) on a 100 concatenated bibles. That's still not a 10x improvement though; the problem is cache locality in the hash map.

cb321 · on Nov 26, 2022

Note: Just concatenating the bibles keeps your hash map artificially small (EDIT: relative to more organic natural language vocabulary statistics)...which matters because as you correctly note the big deal is if you can fit the histogram in the L2 cache as noted elsethread and this really matters if you go parallel where N CPUs*L2 caches can speed things up a lot -- until your histograms blow out CPU-private L2 cache sizes. https://github.com/c-blake/adix/blob/master/tests/wf.nim (or a port to your favorite lang instead of Nim) might make it easy to play with these ideas (and see at least one way to avoid almost all "allocation" - under some interpretations).

A better way to "scale up" is to concatenate various other things from Project Gutenberg: https://www.gutenberg.org/ At least then you have "organic" statistics on the hash.

Const-me · on Nov 26, 2022

> I don't think any language is going to offer that out of the box.

C# offers that out of the box, and the solution is much simpler there.

Pass StringComparer.OrdinalIgnoreCase or similar (InvariantCultureIgnoreCase, CurrentCultureIgnoreCase) to the constructor of the hash map, and the hash map will become case-agnostic. No need to transform strings.

masklinn · on Nov 26, 2022

The one possible issue is by not transforming the string you're going to run the possibly less efficient CI comparison a lot: because the corpus is text and duplicated, by my reckoning there are ~32000 unique "words" out of 82 million inout "words". That's a lot of conflicts.

Though the values should mostly be quite short, so a vectorised comparison might not even trigger as it wouldn't have the time to "stride": only one word of the top 10 even exceeds 4 letters ("shall", a hair under a million in my corpus).

Const-me · on Nov 26, 2022

The C# standard library will not run many case-insensitive comparisons. That comparison object doesn’t just compare two objects for equality, it also has another interface method which computes a hash of a single object.

Here’s implementation of the hash function used by that StringComparer.OrdinalIgnoreCase: https://source.dot.net/#System.Private.CoreLib/src/libraries... As you see, it has a fast path for ASCII-only input strings.

masklinn · on Nov 26, 2022

> The C# standard library will not run many case-insensitive comparisons. That comparison object doesn’t just compare two objects for equality, it also has another interface method which computes a hash of a single object.

Which doesn't matter because I'm talking about identical strings, so they will hash the same by definition, and they will have to be compared.

So the question is how fast the CI hash and equality operate compared to the CS ones.

And I asked about comparison because I assumed that would be the costlier of the two operations, relative to its CS brethren.

Const-me · on Nov 26, 2022

> how fast the CI hash and equality operate compared to the CS ones.

If the string is ASCII like in the OP’s use case, I think the difference is not huge.

CS comparison looks more optimized, they have an inner loop which compares 12 bytes as 3 64-bit values: https://source.dot.net/#System.Private.CoreLib/src/libraries...

CI comparer doesn’t do that, it loads individual UTF-16 elements: https://source.dot.net/#System.Private.CoreLib/src/libraries... But still, it’s very simple code which does sequential memory access.

> And I asked about comparison because I assumed that would be the costlier of the two operations, relative to its CS brethren.

I think the bottleneck is random memory loads from the hash table.

Hashing and comparison do sequential RAM access. The prefetcher in the CPU will do its job, you’ll get 2 memory loads every cycle, for short strings going to be extremely fast. If that hashtable doesn’t fit in L3 cache, the main memory latency is much slower than comparing strings of 10-20 characters, no matter case sensitive or not.

morelisp · on Nov 26, 2022

But all of those optimizations can also be done, and more efficiently, by transforming while reading. Even if they're as fast as they can be, they're not as fast as a memcmp. The C# approach isn't buying you any performance here.

Const-me · on Nov 26, 2022

Yeah, it’s possible to change case while loading, but I’m not sure that’s gonna be much faster.

But I’m sure it gonna be much harder.

For non-ASCII strings, converting case may change their length in bytes. You don’t even know in advance how much memory you need to transform 2GB of input text (or 1MB buffer if streaming). And if streaming, you need to be careful to keep code points together: with a naïve approach you gonna crash with a runtime exception when you split a single codepoint between chunks.

English words are 99.9% ASCII, but that remaining 0.1% like “naïve” is not. The C# standard library is doing the right thing for this use case. Specifically, for 99.9% of words the CI comparer will use the faster ASCII-only code to compare or hash, and only do the expensive shenanigans for small count of non-ASCII words.

Note how C# makes the implementation much simpler. A single parameter passed to the constructor of the Dictionary<string,Something> makes it implement case-insensitivity automagically.

dahfizz · on Nov 26, 2022

> Probably some smart SIMD lowercase function could help here but I don't think any language is going to offer that out of the box.

No, but at least you have direct access to the intrinsics in C. To get vectorization in Go, you have to implement it in C and link that into your Go program.

morelisp · on Nov 26, 2022

> To get vectorization in Go, you have to implement it in C and link that into your Go program.

Go has an assembler which does not require implementing anything in C (though the assembler uses a more C-style syntax), nor critically does it require using CGo linkage. It's used to implement many hot paths in the stdlib.

throwaway894345 · on Nov 26, 2022

Typically you would just implement vectorization in assembly in the Go case.

imtringued · on Nov 26, 2022

The file is only being read once according to the blog and then it is in memory. This isn't an I/O intensive application at all.

I have written a small script in python that does something similar. I have a word list with 1000 words and I check the presence of the words. Here is the thing. For every word I go through the entire file. So lets say I scan the file one thousand times. In fact I did something more complicated and ended up going 6000 times over the original file and yet it still took only three seconds. If all these scans had to reread the file it would take forever.

Severian · on Nov 26, 2022

This sounds like a very inefficient method. The Python stdlib module 'collections' includes a Counter class which can do this for you. Then all you need to do is copy the dict key/item pair to a new one that match your word list.

kragen · on Nov 26, 2022

he has figures for both cached and uncached performance

if all your scans had to reread the file from nvme it would take five times as long if we extrapolate from those figures

not forever

nunez · on Nov 26, 2022

the optimized version does it byte by byte

adverbly · on Nov 26, 2022

Of course disk speed has caught up. Single core clock speed hasn't changed in over 10 years now? Disk now uses SSD and has been chugging along with improvements.

I'm curious what a multi thread(multi process for python due to GIL?) comparison would be. Obviously people aren't doing this by default though, so the author's point still stands.

chronial · on Nov 26, 2022

> Single core clock speed hasn't changed in over 10 years now?

This is not true. I thought the same thing and you are right in regards to base line clock speed. But the performance is still increasing. I just got a new PC at work with a 2021 high-end CPU and it has 200% single core performance of the 2015 mid-range CPU in my private PC:

https://www.cpubenchmark.net/compare/4597vs2599/Intel-i9-129...

0x457 · on Nov 26, 2022

Well, CPU got a lot better at benchmarks, that is true. Caches got bigger, predictions got better. Specialized instructions were added. IPC improvement kinda slowed down after Sandy Bridge, at least for Intel.

Also, the comment you're quoting is talking about clock speed, and the link you provided literally shows the same base clock speed - 3.2 GHz. Intel progressively pushed turbo speed higher, but that the speed you could have achieved yourself by overclocking.

chronial · on Nov 26, 2022

Does the CPU constantly hold the turbo speed under a single threaded workload?

eulers_secret · on Nov 26, 2022

Depends. The cpu attempts to hold the maximum possible speed on any cores that are in use.

On my water cooled and specifically tweaked desktop- yes. It’ll hold max boost indefinitely, even with all threads. (getting to about 80c after 10 mins). Single-thread max is faster, and it’ll hold that as well.

My laptop will pull power within 15 seconds and be down to base clocks in a couple mins. Unless I set it down outside and it’s very cold.

Dylan16807 · on Nov 27, 2022

Most un-tweaked chips are going to be below 25 watts with a single core loaded, and lots of laptops can cool that without any problems.

0x457 · on Nov 26, 2022

It depends on motherboard and cooling. 6700K, for example, is constantly running at 4.2Ghz or 4.5Ghz (winter clocks). Constantly while thermals allow it... Non-overclocking motherboards allow it to boost for 2 minutes, after that, it's iffy.