Hacker News new | past | comments | ask | show | jobs | submit login
I/O is no longer the bottleneck (benhoyt.com)
377 points by benhoyt on Nov 26, 2022 | hide | past | favorite | 326 comments



Yes, sequential I/O bandwidth is closing the gap to memory. [1] The I/O pattern to watch out for, and the biggest reason why e.g. databases do careful caching to memory, is that _random_ I/O is still dreadfully slow. I/O bandwidth is brilliant, but latency is still disappointing compared to memory. Not to mention, in typical Cloud workloads, IOPS are far more expensive than memory.

[1]: https://github.com/sirupsen/napkin-math


> _random_ I/O is still dreadfully slow. I/O bandwidth is brilliant, but latency is still disappointing compared to memory.

The wondrous thing about modern CPU architectures (e.g. Zen3), though, is all the PCIe lanes you get with them. If you really need high random IOPS, you can now cram 24 four-lane NVMe disks into a commodity server (with PCIe M.2 splitter cards) and saturate the link bandwidth on all of them. Throw them all in a RAID0, and stick a filesystem on them with the appropriate stripe width, and you'll get something that's only about 3x higher-latency for cold(!) random reads, than a read from RAM.

(My company provides a data-analytics SaaS product; this is what our pool of [shared multitenant, high concurrency] DB read-replicas look like.)


I thought NVMe flash latency was measured in tens of microseconds. 3x RAM would be a fraction of a microsecond, right?


Under ideal conditions, yes. But the 3x difference I see in practice is less about NVMe being just that good; and more about operations against (main) memory getting bottlenecked under high all-cores concurrent access with no cross-workload memory locality to enable any useful cache coherence. And also about memory accesses only being “in play” when a worker thread isn’t context-switched out; while PCIe-triggered NVMe DMA can proceed while the thread has yielded for some other reason.

In other words, when measured E2E in the context of a larger work-step (one large enough to be interrupted by a context-switch), the mean, amortized difference between the two types of fetch becomes <3x.

Top of my wishlist for future architectures is “more, lower-width memory channels” — i.e. increased intra-CPU NUMAification. Maybe something CXL.mem will roughly simulate — kind of a move from circuit-switched memory to packet-switched memory, as it were.


How do you figure these things out, do you have special software to look at this?


I think the person you're replying to is confusing IOPS with latency. If you add enough parallelism, then NAND flash random-read IOPS will eventually reach DRAM performance.

But it's not going to be easy - for a sense of scale I just tested a 7950x at stock speeds with stock JEDEC DDR5 timings. I inserted a bunch of numbers in an 8GB block of memory, and with a deterministic random seed randomly pick 4kb pages, computing their sum and eventually reporting that (to avoid overly clever dead-code analysis, and make sure the data is fully read).

With an SSD-friendly 4K page size that resulted in 2.8 million iops of QD1 random read. By comparison, a web search for intel's 5800x optane's QD1 results shows 0.11 million iops, and that's the fastest random read SSD there is at those queue depths, AFAIK.

If you add parallelism, then ddr5 reaches 11.6 million iops at QD16 (via 16 threads), fast SSDs reach around 1 million, the optane reaches 1.5 million. An Epyc Genoa server chip has 6 times as many DDR5 memory channels as this client system does; and I'm not sure how well that scales, but 60 million 4kb random read iops sounds reasonable, I assume. Intel's memory controllers are supposedly even better (at least for clients). Turning on XMP and PBO improves results by 15-20%; and even tighter secondary/tertiary timings are likely possible.

I don't think you're going to reach those numbers not even with 24 fast NVMe drives.

And then there's the fact that I picked the ssd-friendly 4kb size; 64-byte random reads reach 260 million iops - that's not quite as much bandwidth as @ 4kb, but the scaling is pretty decent. Good luck reaching those kind of numbers on SSDs, let alone the kind of numbers a 12-channel server might reach...

We're getting close enough that the loss in performance at highly parallel workloads is perhaps acceptable enough for some applications. But it's still going to be a serious engineering challenge to even get there, and you're only going to come close under ideal (for the NAND) circumstances - lower parallelism or smaller pages and it's pretty much hopeless to arrive at even the same order of magnitude.


I measured ~1.2M IOPS (random reads, 4kiB) from 3xNVMe in a software RAID configuration on a commodity server running Ubuntu Linux in 2021. Using Samsung SSDs, not Optane.

If that scaled, it would be 9.6M IOPS from 24xNVMe.


Which is quite respectable, but still nevertheless a far cry from the presumable 60M+ iops the server would have using dram (if it scales linearly, which I doubt, it would hit 70M). Also, DRAM gets quite close to those numbers with only around 2 times as many threads as dram channels, but that NVMe setup will likely need parallelism of at least 100 to reach that - maybe much more.

Still, a mere factor 7 isn't a _huge_ difference. Plenty of use cases for that, especially since NAND has other advantages like cost/GB, capacity, and persistence.

But it's also not like this is going to replace dram very quickly. Iops is one thing, but latency is another, and there dram is still much faster; like close to 1000 times faster.


at this point cost would become the bottleneck. compare 24x1TB NVMe drives to 24TB of DDR5


That's an entirely different dimension - you can reach these throughput numbers on DDR5 likely even with merely 16GB. And an massive 12-channel 6TB socket solution will likely have slightly less than 6 times the random-read bandwidth. Capacity and bandwidth aren't strongly related here.


I’m running a FreeNAS box on an i3-8100. Right now I’m converting the NAS and my desktop to server chassis and putting them in a rack. Once I get a 10GB Unifi switch and NICs off ebay, I’m debating on running my desktop and servers diskless using iSCSI backed up by RAID0 NVME drives.


Whatever floats your boat, but iSCSI is limited to 1500 MTU (9k? Are you sure you can boot with 9k enabled?) and while you can have 10Gbit thoughput that doesn't mean what you will get it always, eg 100 IO operations would generate 100 packets and it doesn't matter if it was 1500B each or only 100B.

And you wouldn't see the speed improvement on RAID0 NVMe drives except extremely rare fully sequential operations lasting for at least tens of seconds.

You also can try it just by running a VM with iSCSI boot on your current desktop.


Been a long time since anything iscsi related didn't hand 9k, for boot or otherwise.

But I look at it this way. You need 40gbit networking for a single pci3 nvme ( and newer drives can saturate that, or close )

And because you're throttling throughput you'll see much more frequent, longer, queuing delays, on the back of a network stack that ( unless you're using rdma ) is already 5x-10x slower than nvme.

It'll be fast enough for lots of things, especially home/lab use, and it'll be amazing if you're upgrading from sata spinning disk.. but 10gbit is slow by modern storage standards.

Of course, that's not the only consideration. Shared storage and iscsi in particular can be extremely convenient! And sometimes offers storage functionality that clients don't have ( snapshots, compression, replication )


> Been a long time since anything iscsi related didn't hand 9k, for boot or otherwise.

Don't have anything on the hands to look if the boot firmware even allows to set 9k, but I didn't touch iSCSI boot for a long time, so I would take your word for it.

> But I look at it this way. You need 40gbit networking ... is already 5x-10x slower than nvme.

This one.

> It'll be fast enough for lots of things, especially home/lab use

Yep, in OP's case I would consider just leaving the OS on the local [fast enough] drive and using iSCSI (if for some reason NFS/SMB doesn't fit) for any additional storage. It would be fast enough for almost everything, while completely eliminating any iSCSI boot shenanigans /me shudders in Broadcom flashbacks.

Another neat thing about iSCSI is what you can re/connect it to any device on the network in a couple of minutes (first time, even faster later), sometimes it comes really handy.


> Whatever floats your boat, but iSCSI is limited to 1500 MTU (9k? Are you sure you can boot with 9k enabled?) and while you can have 10Gbit throughput that doesn't mean what you will get it always, eg 100 IO operations would generate 100 packets and it doesn't matter if it was 1500B each or only 100B.

Ugh, ISCSI does have queueing so you can have many operations in flight, and one operation doesn't really translate to one packet in the first place, kernel will happily pack few smaller operations to TCP socket into one packet when there is load.

The single queue is the problem here but dumb admin trick is just to up more than one IP on the server and connect all of them via multipath


> kernel will happily pack few smaller operations to TCP socket into one packet when there is load.

And here comes the latency! shining.jpg

It wouldn't be a problem for a desktop use of course[0], especially considering what 90% of operations are just read requests.

My example is crude and was more to highlight what iSCSI, by virtue of running over Ethernet, inherently has a limit of how many concurrent operations can go in one moment. It's not a problem for a HDD packed SAN (HDDs would impose an upper limit, because spinning rust is spinning) but for a NVMe (especially with a single target) it could diminish the benefits of such fast storage.

> The single queue is the problem here but dumb admin trick is just to up more than one IP on the server and connect all of them via multipath

Even on a single physical link? Could work if the load is queue bound...

[0] hell, even on 1Gb link you could run multiple VMs just fine, it's just when you start to move hundreds of GBs...


>> kernel will happily pack few smaller operations to TCP socket into one packet when there is load.

>And here comes the latency! shining.jpg

Not really, if you get data faster than you can send packets (link full) there wouldn't be that much extra latency from that (at most one packet length which at 10Gbit speeds is very short) and it would be more than offset by the savings

Then again I'd guess that's mostly academic as I'd imagine not very many ISCSI operations are small enough to matter. Most apps read more than a byte at a time after all, hell, you literally can't read less than a block from a block device which is at least 512 bytes.

>> The single queue is the problem here but dumb admin trick is just to up more than one IP on the server and connect all of them via multipath

> Even on a single physical link? Could work if the load is queue bound...

You can also use it to use multiple NICs without bonding/teaming, althought it is easier to have them in separate network, IIRC linux had some funny business when if you didn't configure it correctly for traffic in same network it would pick "first available" NIC to send it and it needed /proc setting to change

To elaborate, default setting for /proc/sys/net/ipv4/conf/interface/arp_ignore (and arp_announce) is 0 which means

> 0 - (default): reply for any local target IP address, configured on any interface

> 0 - (default) Use any local address, configured on any interface 1

IIRC to do what I said required

    net.ipv4.conf.all.arp_ignore=1
    net.ipv4.conf.all.arp_announce=2
which basically changed that to "only send/respond to ARPs from NICs where actual address exists, not just ones with the address in same network" and fixed the problem.


> I'd guess that's mostly academic

It is, that mattered on 1Gbit links with multiple clients, ie any disk operations in VMs while there is vMotion running on the same links - you could see how everything started to crawl (and returned back after vMotion completed). For 10Gbit you need way, way more load for it to matter.

> You can also use it to use multiple NICs without bonding/teaming

You MUST (as in RFC) use multiple links without bonding and I learned to not to use LACP the hard way (yea, reading docs before is for pussies).

After second attempt I understood the implication (multiple NICs in the same IP network), but this is a self inflicted wound, usually. You don't even need a physically separate networks (VLANs), but using separate IP networks works fine, it's up to initiator to use RR/LB on them.

> it would pick "first available" NIC to send it

Yep, the usual magic of doing things to be easier for average folks. In the same vein - you need to disable Proxy ARP in any modern non-flat network or you will get shenanigans what would drive you mad.


I’m out of SATA ports and I have 2 M.2 slots available. When I can test with VM in my current desktop I will.


That's a lot of effort to put silent piece of silicon few metres away from the machine.

iSCSI gotta eat some of your CPU (you're changing "send a request to disk controller and wait" to "do a bunch of work to create packet,send it over the network, and get it back) if you don't have card with offload, it also might kinda not be fast enough to get the most out of NVMe, especially more in RAID0

And, uh, just don't keep anything important there...


It’s an i3 with 2 M.2 slots available. Enough for the home. SATA becomes the limit.


As a dev who operates fairly far away from hardware usually, is that similar to what the PS5 is doing?


Depending on how your madvise is set up, it's often the case that sequential disk reads are memory reads. You're typically only paying the price for touching the first page in a sequential run, that or subsequent page faults come at a big discount.

If you read 1,000,000 random bytes (~1 Mb) scattered across a huge file (let's say you're fetching from some humongous on-disk hash table), it will to a first order be about as slow as reading 4 Gb sequentially. This will incur the same number of page faults. There are ways of speeding this up, but only so much.

Although, I/O is like an onion of caching layers, so in practice this may or may not hold up depending on previous access patterns of the file, lunar cycles, whether venus is in retrograde.


`madvise(2)` doesn't matter _that_ much in my experience with [1] on modern Linux kernels. SSD just can't read _quite_ as quickly as memory in my testing. Sure, SSD will be able to re-read a lot into ram, analogous to how memory reading will be able to rapidly prefetch into L1.

I get ~30 GiB/s for threaded sequential memory reads, but ~4 GiB/s for SSD. However, I think the SSD number is single-threaded and not even with io_uring—so I need to regenerate those numbers. It's possible it could be 2-4x better.

[1]: https://github.com/sirupsen/napkin-math


I think the effects of madvise primarily crop up in extremely I/O-saturated scenarios, which are rare. Reads primarily incur latency, with a good SSD it's hard to actually run into IOPS limitations and you're not likely to run out of RAM for caching either in this scenario. MADV_RANDOM is usually a pessimization, MADV_SEQUENTIAL may help if you are truly reading sequentially, but may also worsen performance as pages don't linger as long.

But as I mentioned, there's caching upon caching, and also protocol level optimizations, and hardware-level considerations (physical block size may be quite large but is generally unknown).

It's nearly impossible to benchmark this stuff in a meaningful way. Or rather, it's nearly impossible to know what you are benchmarking, as there are a lot of nontrivially stateful parts all the way down that have real impact on your performance.

There are so many moving parts I think the only meaningful disk benchmarks consider whatever application you want to make go faster. Do the change. Is it faster? Great. Is it not? Well at least you learned.


> I get ~30 GiB/s for threaded sequential memory reads, but ~4 GiB/s for SSD. However, I think the SSD number is single-threaded and not even with io_uring—so I need to regenerate those numbers. It's possible it could be 2-4x better.

Assuming that you run the experiments on NVMe SSD which is attached to PCIe 3.0, where theoretical maximum is around 1GB/s per each lane, I am not sure I understand how do you expect to go faster than 4 GiB/s? Isn't that already a theoretical maximum of what you can achieve?


PCIe 4.0 SSDs are pretty common nowadays and are basically limited to what PCIe 4.0 x4 can do (around 7 GB/s net throughput).


I don't think they're that common. You would have to have quite recentish motherboard and CPU that both support PCIe 4.0.

And I'm pretty sure that parent comment doesn't own such a machine because otherwise I'd expect 7-8GB/s figure to be reported in the first place.


I really doubt they’re that common. They only became available on motherboards fairly recently, and are quite expensive.

I’d guess that they’re a small minority of devices at the moment.


PCIe 5.0 has just recently started showing up on consumer motherboards.

4.0 might not be common, but surprisingly it is now the previous generation!


You might be very right about that! It's been a while since I did the SSD benchmarks. Glad to hear it's most likely entirely accurate at 4 GiB/s then!


How'd you measure the maximum memory bandwidth? In Algorithmica's benchmark, the max bandwidth was observed to be about 42 GBPS: https://en.algorithmica.org/hpc/cpu-cache/sharing/

I'm not sure how they calculated the theoretical limit of 42.4 GBPS, but they have multiple measurements higher than 30 GBPS.


> Yes, sequential I/O bandwidth is closing the gap to memory.

Hilariously meanwhile, RAM has become significantly slower compared to CPU performance, i.e. you spend a disproportionate time to read and write to memory, so despite RAM is faster, CPU is way faster.

Which means I/O remains a bottleneck...


Random I/O with NVME is slower than sequential I/O still, but the gap between the two has been narrowed considerably and is quite high in historical/comparative absolute terms. To get close to peak random I/O limits, you need to dispatch a lot of commands in parallel—that’s an I/O idiom that doesn’t really exist in high level languages, and I think that’s where a lot of the challenge is.


The problem is that a lot of workloads using random I/O have dependencies between the different I/O operations. A database traversing a tree-like index cannot issue the next read until the previous one has finished. You are limited by latency, which for NVMe is still orders of magnitude worse than memory.


> A database traversing a tree-like index cannot issue the next read until the previous one has finished.

This applies to a single point query in a single tree.

The latency is reduced by overlap in obvious ways as soon as you have (1) a range query because it can read multiple subtrees in parallel, or (2) a query that reads multiple indexes in parallel, or (3) multiple queries from the application to the database in parallel.

This is why it's useful to design applications to make multiple queries in parallel. Web applications are a great example of this. Most applications where I/O performance matters at all have some natural way to parallelise queries.

Less obviously, the interior blocks of a B-tree are a relatively small part of a B-tree. I.e. most of the space is in used leaf blocks. If the database's cache strategy gives preference to interior nodes, and even more preference to nodes closer to the root of a tree, often several interior layers of the tree can fit entirely in RAM and the effect is is to reduce the latency of tree lookups further once the cache is warmed up.

Then even in large databases (a few TB), the latency of a single point query is reduced to a one or two read IOPS (because the leaf page to read which contains the query result is calculated from in-memory data). The application-visible query time is very similar to the I/O subsystem's timing characteristics, and a few MQPS are achievable (= "million queries per second"). Not many database engines achieve this, because they were designed in an area where I/O was much slower, but the I/O architecture does support it.

Source: Wrote a performance-optimised database engine for blockchain archive data, which is extremely random access (because of hashing), in the multiple terabytes range, and the application is bottlenecked on how many queries per second it can achieve. It's like the ideal case for working on random-access I/O performance :-)


Those are rarely the slow ones though. Lots of software simply has not been written to keep IO queues full. They read a bit of data, process it, read the next bit and so on. On a single thread. This makes all kinds of IO (including network) way slower than it has to be.

For example a tree-index can be parallelized by walking down different branches. On top of that one can issue a prefetch for the next node (on each branch) while processing the current ones.


Yup, a lot of software is (was?) written with assumptions that mattered with spinning rust. And even if author didn't intend to, serial code generates serial, dependent IO.


I thought for SSDs it didn't matter whether data was adjacent on disk?


Well, yes and no.

With spinning rust you have to wait for the sector you want to read to rotate underneath the read head. For a fast 10.000 RPM drive, a single rotation takes 6 milliseconds. This means that for random access the average latency is going to be 3 milliseconds - and even that's ignoring the need to move the read head between different tracks! Sequential data doesn't suffer from this, because it'll be passing underneath the read head in the exact order you want - you can even take the track switching time into account to make this even better.

SSDs have a different problem. Due to the way NAND is physically constructed it is only possible to read a single page at a time, and accessing a single page has a latency of a few nanoseconds. This immediately places a lower limit on the random read access time. However, SSDs allow you to send read commands which span many pages, allowing the SSD to reorder the reads in the most optimal way, and do multiple reads in parallel. This means that you only have to pay the random access penalty once - not to mention that you have to issue way fewer commands to the SSD.

SSDs try to make this somewhat better by having a very deep command queue: you can issue literally thousands of random reads at once, and the SSD will reorder them for faster execution. Unfortunately this doesn't gain you a lot if your random reads have dependencies, such as when traversing a tree structure, and you are still wasting a lot of effort reading entire pages when you only need a few bytes.


Interesting, thanks! So it sounds like it's not so much "random" I/O that's slow, but rather "unbatched" I/O or something like that?

Curious to hear your thoughts on this thread if you have time to share: https://news.ycombinator.com/item?id=33752870


> Unfortunately this doesn't gain you a lot if your random reads have dependencies, such as when traversing a tree structure,

So, this mean Btrees suffer? Which could be the most optimal layout for a database storage where only SSD matters?

I'm working in one that is just WAL-only and scanning all in each operation (for now!) and wanna see what I can do for improve the situation.


You really need an NVME interface to the SSD, though. SATA3 is the bottleneck for SATA SSDs


It is not just about IO itself but all the Processing which Database (or OS) needs to do for cache management, which in OLTP cases can be very significant. If you just need few bytes from that 8K/16K page you have to read there is little way around 2 orders of magnitude difference.

What we would really benefit is storage which is efficient in small (cpu cache line) size IO


I question the methodology.

To measure this I would have N processes reading the file from disk with the max number of parallel heads (typically 16 I think). These would go straight into memory. It's possible you could do this with one process and the kernel will split up the block read into 16 parallel reads as well, needs investigation.

Then I would use the rest of the compute for number crunching as fast as possible using as many available cores as possible: for this problem, I think that would basically boil down to a map reduce. Possibly a lock-free concurrent hashmap could be competitive.

Now, run these in parallel and measure the real time from start to finish of both. Also gets the total CPU time spent for reference.

I'm pretty sure the author's results are polluted: while they are processing data the kernel is caching the next block. Also, it's not really fair to compare single threaded disk IO to a single process: one of the reasons for IO being a bottleneck is that it has concurrency constraints. Never the less I would be interested in both the single threaded and concurrent results.


I agree, I think there is often a faulty assumption by many developers doing benchmarking that their underlying environment doesn't matter. Often I see performance results posted with little mention of the details of the environment. At least here they posted they were using a high-end SSD, but note they just say they're on a "Dell XPS 13", as if there aren't multiple variants of that model produced every year for the last 5 or 6 years.

You're probably also right their multiple test runs resulted in the OS caching data, and a lot of the test runs may have just been testing in-memory performance instead of raw storage I/O performance.


Fair call about "Dell XPS 13". I've now clarified in the article (it's a recent 2022 Dell XPS 13 Plus).

Regarding OS caching: I'm trying to avoid this by clearing caches with the "sysctl vm.drop_caches=3" command. Note that I show both cached and uncached numbers.


It would still be good to call out the exact processor model, as the lowest end i5-1240P has half the L3 cache as the highest i7-1280P.

Also I don't think it was clear you were running "sysctl vm.drop_caches=3" between benchmarking runs of your optimizations. Your table seemed to indicate those were generic initial read/write benchmarks from either dd or hardparm. The site you linked to also had comments on it stating dd is not very good for benchmarking, suggesting fio & a different site[1].

1. https://linuxreviews.org/HOWTO_Test_Disk_I/O_Performance


Thank you. I'd be pretty annoyed in this interview. Surely my potential employer would be more interested in having me apply my twenty years of real-world experiences to what I learned in CS240.


Not sure why you'd be annoyed? What I'm trying to gauge in my interviews is their real-world experience (not their academic CS knowledge). That's part of the reason I find my line of questioning helpful: I don't penalize them for a "wrong" answer, but more how they think about it. Then we can discuss parallelization, map-reduce techniques, profiling and measurement, and so on.


Annoyed because of the framing, that is: the answer isn't "wrong" in the first place.


Interviews usually have extremely limited time, especially if some interesting “tangent” comes up, which often tell the true depth of their knowledge. Maybe the concurrency would be one possible tangent.

Regardless, the concurrent approach would be 90% the same as the single thread approach, leaving it for a good “after the fact” question, assuming the candidate still has time.


> I haven’t shown an optimized Python version because it’s hard to optimize Python much further! (I got the time down from 8.4 to 7.5 seconds). It’s as fast as it is because the core operations are happening in C code – that’s why it so often doesn’t matter that “Python is slow”.

An obvious optimization would be to utilize all available CPU cores by using the MapReduce pattern with multiple threads.

I believe that'd be necessary for a fair conclusion anyway, as you can't claim that I/O isn't the bottleneck, without utilizing all of the available CPU and memory resources.


> An obvious optimization would be to utilize all available CPU cores by using the MapReduce pattern with multiple threads.

Nope, the GIL will make that useless. You need to actually implement the tight loops in C/C++ and call that with batches of data to get benefits from threading.

An obvious, but more expensive optimization would be to use a process pool. Make sure that all the objects you pass around are serializable.

Python makes optimization much harder than it should be. I hope the GIL gets the hammer at some point, but that seems to be a huge task.


For this problem the multiple process version would be quite simple in python or any other languages. It's a classic same program multiple data (SPMD) task. You split the file into N chunks than run N versions of the original program on it (a Map). You then need to collate the results, which required a second program, but that step is similar to the sorting step in the original and so would be negligible wrt wall time (a quick Reduce).

For large files you should get almost embarrassing parallelism.


Oh I think a few simd instructions could reduce processing to near zero without going crazy with multi-threaded architectures.

Remember that fizzbuzz on HN that hit GB/s? Mostly SIMD. Zero multi-threaded IIRC.


The GIL won't prevent you from parallelizing I/O will it?


If I/O was the bottleneck, parallelizing it won't help, your SSD/network link/database won't get magically faster.

If I/O wasn't the bottleneck, I guess you can parallelize reading, but what are you gaining?

If you're writing to files, most of the time the parallism will be hard to implement correctly. SQLite doesn't support parallel writes for example.


>If I/O was the bottleneck, parallelizing it won't help, your SSD/network link/database won't get magically faster.

I think your SSD/network link/database might be able to work in parallel even when Python can't. Details:

Suppose I am scraping a website using a breadth-first approach. I have a long queue of pages to scrape. A single-threaded scraper looks like: pop the next page in the queue, block until the web server returns that page, repeat. A multi-threaded scraper looks like: thread wakes up, pops the next page in the queue, sleeps until the web server returns that page, repeat. With the multi-threaded scraper I can initiate additional downloads while the thread sleeps.

My assumption here is that the download over the network is at some level being performed by making a system call (how could it not be?) And once you have multiple system calls going, they can be as parallel as the OS permits them to be; the OS doesn't have to worry about the GIL. And also the server should be able to serve requests in parallel (assuming for the sake of argument that the server doesn't suffer from the GIL).

Same essential argument applies to the database. Suppose I'm communicating with the database using IPC. The database isn't written in Python and doesn't suffer from the GIL. Multiple Python threads can be sleeping on the database while the database processes their requests, possibly in parallel if the db supports that.

I think this argument could even work for the SSD if the kernel is able to batch your requests in a way that takes advantage of the hardware, according to this person: https://news.ycombinator.com/item?id=33752411

Very curious to hear your thoughts here. Essentially my argument is that the SSD/network link/database could be a "bottleneck" in terms of latency without being the bottleneck in terms of throughput (i.e. it has unused parallel capacity even though it's operating at maximum speed).


You're right, my comment only applies when bandwidth is the bottleneck. In Python, that web parser could probably do even better with asyncio in a single OS thread.


> If I/O was the bottleneck, parallelizing it won't help, your SSD/network link/database won't get magically faster.

Of course it will. Near every serious DB will allow to work on multiple requests in parallel and unless the DB itself is on something that's slow you will get data faster from 2 parallel requests than from serializing them

NVMe SSDs in particular can easily fill what single thread can read, just run fio with single vs parallel threads to see that.

> If you're writing to files, most of the time the parallism will be hard to implement correctly. SQLite doesn't support parallel writes for example.

That's just one random example. If all you do is "read data, parse ,write data" in some batch job you can have massive parallelism. Sharding is also easy way to fill up the IO.


Haven't tried it, but SQLite supports some type of concurrent modification now

https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begi...


I’ve definitely parallelized http requests for significant improvement in python before.


You can use Processpool but at that point you’re way into re-architecting.


Not much more than the switch to MapReduce and threads. Actually the interface is exactly the same if you use executors from concurrent.futures.


inter-process communication has its own overhead though.


You can usually counteract that by sending large enough batches to the processes.


> Nope, the GIL will make that useless.

In Python yes. I missed that. The Go implementation would still benefit from multiple threads, wouldn't it?


Yes I think so.


A more obvious optimisation (to me) would be to leverage the native functions and avoid creating a list of ~80 million strings in memory.

On my machine, the base script pretty reliably takes ~10s:

    Reading   : 0.1935129165649414
    Processing: 9.955206871032715
    Sorting   : 0.0067043304443359375
    Outputting: 0.01335597038269043
    TOTAL     : 10.168780088424683
Switching content to a no-op (`content = sys.stdin`) and feeding `Counter` from a native iterators pipeline:

    counts = collections.Counter(chain.from_iterable(map(str.split, map(str.lower, content))))
is a pretty reliable 10% gain:

    Reading   : 1.1920928955078125e-06
    Processing: 8.863707780838013
    Sorting   : 0.004117012023925781
    Outputting: 0.012418985366821289
    TOTAL     : 8.880244970321655
As far as I can tell, the bottleneck is about half the preprocessing (lowercasing and splitting) and half filling the Counter.

You won't get a 10x gain out of that though.


That is of course an easy solution but I would argue that this is just throwing more resources at the problem. Not a very impressive optimization.

Emery Berger has a great talk [1] where he argues that it is mostly pointless to optimize python code, if your program is slow, you should look for a properly optimized library to do that work for you.

1: https://www.youtube.com/watch?v=vVUnCXKuNOg


> That is of course an easy solution but I would argue that this is just throwing more resources at the problem. Not a very impressive optimization.

You could say the same about the existing implementation as that reads the whole file into memory instead of processing it in chunks.


Or better yet, write a compute shader since hashing is an embarrassingly parallel operation.

That said, the OP's article is correct in that straightforward idiomatic implementations of this algorithm are very much compute bound. The corollary is that eng work put into optimizing compute usage often won't be waisted for programs processing disk data (or even network data with modern 10Gb fiber connections).


EBS costs crazy crazy amounts for reasonable iops

We pay 7k per month for RDS that can do barely 2k iops.. in the same time a machine at hetzner does 2 million iops for 250 euro per month (not to mention it also have 4x more codes and 5x more ram).

So, even though I/O is no longer the bottle neck physically, it still is a considerable issue and design challenge on the cloud.


Well yes it's a total ripoff

I installed a DB-Server for a Customer around 2years ago, in a DC near him with 16 cores 48GB Ram and ~6TB -> 12 SSD, vDevs mirror with 2, and Stripe over the mirrored vDevs (kind of a Raid10 but zfs), compression zstd (1GB could be compressed down to ~200MB so 5 times less reading/writing, and in theory ~30TB of pure DB-Data, 20TB realistic, remember never fill a zpool over 72%) record-size 16kb (postgresql). After 3 month the machine was paid (compared to the "cloud"-price) and the performance kind of 10-12 times higher.

Called the customer about a two month ago and he said the DB-Server is still to fast and maybe he wants another one who uses less power... ;)


The cloud costs really are between "unreasonable" and "very unreasonable". The only time when it gets cheaper is if workload would be so spiky we coul'dve turned most of it off for 2/3 of the day but most of what we have is many smaller projects that don't even have enough traffic to even need scaling and the big ones, well, can't scale database size down off-peak...

Over last ~6 years we did "is it worth going to cloud" calculation few times and it was always ridiculously more expensive.


Virtualized storage system increases latency (QD1 IOPS). Naively built non-parallel apps tend to rely on QD1 IOPS performance, so it runs very slowly on cloud platform, compared to dev machine with direct attached NVMe.


>Naively built non-parallel apps tend to rely on QD1 IOPS performance

PostgreSQL is pretty much parallel but i know what you mean...the beehive ;)


> remember never fill a zpool over 72%

Could you please explain where this number comes from?


Yeah, that's a myth now. It's not current advice.


72% is my rule of thumb for write heavy production stuff (my absolute limit would be 75%) but it depends on record-size, raidz-level, if you have mostly write or mostly read workloads, how big your files are, how many snapshots you have, if you have a dedicated ZIL-Device and much more. For a Home-NAS (Movies etc) you can easy go up to 85%...if it's a "~WORM" workload maybe 90%...but resilvering can then be a thing of days (weeks?), depends on the raidz-level or mirror etc.

>Yeah, that's a myth now. It's not current advice.

It's not and you know it, keep it under 72% believe me if you want a performant zfs (especially if you delete files and have many snapshots...check the YT linked at the end)

>>Keep pool space under 80% utilization to maintain pool performance. Currently, pool performance can degrade when a pool is very full and file systems are updated frequently, such as on a busy mail server. Full pools might cause a performance penalty, but no other issues. If the primary workload is immutable files (write once, never remove), then you can keep a pool in the 95-96% utilization range. Keep in mind that even with mostly static content in the 95-96% range, write, read, and resilvering performance might suffer.

https://web.archive.org/web/20150905142644/http://www.solari...

And under no circumstances go over 90%:

https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

>An introduction to the implementation of ZFS - Kirk McKusick

https://www.youtube.com/watch?v=TQe-nnJPNF8


Agree! And at the end of the day, we are optimizing for cost. Although, the EBS portion of that 7k RDS bill is going to be tiny, right?


well its tiny if you keep it 2k, and make sure you dont touch disk much, god forbid someone makes a query that requires a temp table.. a query you wouldnt even notice in a baremetal machine brings the whole rds setup down, cant even read the writeahead log and cant replicate and etc.. its like watching a slow motion train wreck from 2 queries per second


the state of benchmarking by normal IT people is tragic. If one checks out his 'optimization problem statement' article [1] they can find:

>ASCII: it’s okay to only support ASCII

>Threading: it should run in a single thread on a single machine

>Stdlib: only use the language’s standard library functions.

This is truly 1978 all over again. No flame graphs, no hardware counters, no bottleneck analysis. Using these 'optimizations' for job interviews is questionable at best.

[1] https://benhoyt.com/writings/count-words/


I'm happy to be a "normal IT person". :-)

If you look further at the count-words article you linked, I do have profiling graphs and bottleneck analysis.

Note that the interview questions I ask are open-ended, not trying to trick or trap people into giving the "wrong" answer. I like to have more of a discussion to see how they think about the problem, what data structures they'd use, how they'd profile, and so on.


sry for late reply. I am glad that you are raising these topics regularly tbh. If more people stop write only coding and ask themselves questions you have asked it will be already big step forward. If they ask what profiling tools they should use it will be great. I did had a look into your article and I can not recommend limiting to tools you have used.

I come from gamedev low level coding and performance analysis so I understand that my point of view is not normal xD


Still better than obscure question from page 20 of Leetcode.


Interesting! This made me wonder -- would this kind of optimization be recognized and rewarded in colossal scale organizations?

I've seen comments about Google multiple times here where people say you wont be getting promotions unless you're shipping new things -- maintaining the old wont do it.

But if you get to something core enough, it seems like the numbers would be pretty tangible and easy to point to during perf review time?

"Found a smoother way to sort numbers that reduced the "whirrrrrr" noise our disks made. It turns out this reduces disk failure rates by 1%, arrested nanoscale structural damage to the buildings our servers are in, allowed a reduction in necessary PPE, elongaded depreciation offsets and other things -- this one line of code has saved Google a billion dollars. That's why my compensation should be increased to include allowing me to fall limply into the arms of another and be carried, drooling, into the office, where others will dress me"

In this hypothetical scenario, would a Googler be told "Your request has been approved, it may take one or two payment periods before your new benefits break into your apartment" or "No, you need to ship another chat program before you're eligible for that."?


Yeah, when I was there I saw plenty of <1% optimizations saving REDACTED gobs of money, and people were rewarded for it. I don't think it's applicable to most teams though.

Imagine a foo/bar/widget app that only serves 20B people (obvious exaggeration to illustrate the point) and is only necessary up to a few hundred times per day. You can handle that sort of traffic on a laptop on my home router and still have enough hootzpah left to stream netflix. I mean, you are Google, and you need to do something better than that [0], but the hardware for your project is going to be negligible compared to other concerns unless you're doing FHE or video transcoding or something extraordinarily expensive.

Walk that backward to, how many teams have 20B users or are doing extraordinarily expensive things? I don't have any clue, but when you look at public examples of cheap things that never got much traction and probably had a suite of engineers [1], I'd imagine it's not everyone in any case. You're probably mostly looking at people with enough seniority to be able to choose to work on core code affecting most services.

[0] https://www.youtube.com/watch?v=3t6L-FlfeaI

[1] https://killedbygoogle.com/


Google seems to have problem of "you won't get promotion/raise if you don't work on something new" so they are not interested by services that just work and provide tidy little constant revenue


To some extent it's true on a macro scale, but it's a trope to say this is exclusively the case. I would say it is easier to paint the picture for promotion by working on new things but not by a large amount. The vast majority of folks at Google work on maintenance, as would be expected.


> Found a smoother way to sort numbers that reduced the "whirrrrrr" noise our disks made. It turns out this reduces disk failure rates by 1%, arrested nanoscale structural damage to the buildings our servers are in, allowed a reduction in necessary PPE, elongaded depreciation offsets and other things -- this one line of code has saved Google a billion dollars

Hah! I mean, if you can truly prove a business benefit by improving performance, I’m sure that you’d have a good shot at a promotion. Thing is it’s actually quite difficult to do so, and in the likely chance you cannot it just looks like you squandered a bunch of time for no reason.


Metrics are religiously collected and any sizable performance improvement will have a clear impact on one metric or another no?


Pay increases and promotions are for the most part a little more formulaic and upper-bounded than what you describe, but generally speaking if you can prove you saved the company $XM you will be rewarded (from firsthand experience, this is also true at Meta).


"Promotion? You must be joking. Your little stunt caused a revenue spike that pushed us over the threshold in $country, prompting $regulatory-agency to look into starting antitrust proceedings. Mitigating this will require an army of lawyers, that will cost approximately a third of the $1B you 'saved' us. Additionally, we will now have to create another throwaway chat app, for which we'll allocate another third of a billion in the budget. The final third... will go to executive bonuses, obviously.

You are hereby placed on a Performance Improvement Plan, starting tomorrow. On the off chance you'll come out of the other end still employed, keep in mind that your manager isn't being stupid by forbidding such 'optimizations', they're just following orders."


They likely have become one of the 10k getting fired for "underperforming". Company management in basically all businesses don't like know it alls that do the opposite of what they are told regardless of the outcome.


Google's core services wouldn't work so reliably if they didn't value optimization and engineering. I don't work there, but I'm pretty sure that the SREs and the developers behind Search and Maps don't get fired based on how many products they launched.


Is there a valid source for this number? All the articles I see seem to quote each other.


You don’t know what you’re talking about.


[former Googler]

Yes, I occasionally saw people get highlighted for making optimizations like "this saves 1% in [some important service]". When you're running millions of machines, 1% is a lot of machines. However, it's also likely the case that the easy 1%s have already been found...


> This made me wonder -- would this kind of optimization be recognized and rewarded in colossal scale organizations?

It depends if this kind of optimization is valuable to the organization. Often times it's not. Spending money and time to save money and time is often viewed as less efficient than generating more revenue.


I was recently working on parsing 100K CSV files and inserting them into a database. The files have a non-column-oriented header and other idiosyncrasies so they can't be directly imported easily. They're stored on an HDD so my first instinct was to focus on I/O: read the whole file into memory as an async operation so that there are fewer larger IOs to help the HDD and so that other parser tasks can do work while waiting for the read to complete. I used a pretty featureful C# CSV parsing library which did pretty well on benchmarks [0] (CsvHelper) so I wasn't really worried about that part.

But that intuition was completely wrong. The 100K CSV files only add up to about 2GB. Despite being many small files reading through them all is pretty fast the first time, even on Windows, and then they're in the cache and you can ripgrep through them all almost instantaneously. The pretty fast parser library is fast because it uses runtime code generation for the specific object type that is being deserialized. The overhead of allocating a bunch of complex parser and typeconverter objects, doing reflection on the parsed types, and generating code for a parser, means that for parsing lots of tiny files its really slow.

I had to stop worrying about it because 2 minutes is fast enough for a batch import process but it bothers me still.

Edit: CsvHelper doesn't have APIs to reuse parser objects. I tested patching in a ConcurrentDictionary to cache the generated code and it massively sped up the import. But again it was fast enough and I couldn't let myself get nerd sniped.

Edit2: the import process would run in production on a server with low average load, 256GB RAM, and ZFS with zstd compression. So the CSV files will live permanently in the page cache and ZFS ARC. The import will probably run a few dozen times a day to catch changes. IO is really not going to be the problem. In fact, it would probably speed things up to switch to synchronous reads and remove all the async overhead. Oh well.

[0]: https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-pa...


My immediate thought was are you measuring throughput or latency?

The latency of reading from disk is indeed very slow compared to CPU instructions.

A 3ghz clock speed processor is running 3 billion (3,000,000,000 cycles a second) and some instructions take 1 cycle. You get 3 cycles per nanosecond. A SSD or spinning disk access costs many multiples of cycles.

Read 1 MB sequentially from SSD* 1,000,000

That's a lot of time that could be spent doing additions or looping.

https://gist.github.com/jboner/2841832


That is true, but assuming file I/O is blocking you also have to pay for a context switch to take advantage of that.

But I guess you could avoid that using eg. io_uring.


And you get 2-4 instructions per cycle on average.


I agree - "IO bandwidth is no longer a BW bottleneck" would be a better title.


I encountered this myself yesterday when attempting to performance test WebSockets in JavaScript: https://github.com/prettydiff/share-file-systems/blob/master...

The parsing challenge is complex enough that it will always be faster to extract the data from the network than it is to process it. As a result excess data must be stored until it can be evaluated or else it must be dropped, therefore the primary processing limitation is memory access not CPU speed executing instructions. JavaScript is a garbage collected language, so you are at the mercy of the language and it doesn't really matter how you write the code because if the message input frequency is high enough and large enough memory will always be the bottleneck, not the network or the application code.

In terms of numbers this is provable. When testing WebSocket performance on my old desktop with DDR3 memory I was sending messages (without a queue or any kind of safety consideration) at about 180,000 messages per second. In my laptop with DDR4 memory the same test indicated a message send speed at about 420,000 messages per second. The CPU in the old desktop is faster and more powerful than the CPU in the laptop.


NVME storage really is very fast for sequential reads, but I'd respectfully suggest that for simple tasks a Dell laptop with 1.6GB/s read speed should be bottlenecked by IO if the compute is optimised. For example SIMD-json can parse json at over 7GB/s. https://simdjson.org/


SSD is pretty fast; but my app is actually trying to do more than 100_000 read-modify-write cycles per second and that still requires careful thought about the database and schema we're using.

CPU and RAM are pretty fast. I do a live-coding interview question and I ask people do to a naive implementation first, then later I ask about possible optimizations. A third to a half of candidates want to do fewer RAM accesses and oh by is that the wrong avenue for this problem - especially when they just wrote their solution in Python and you could get a 10x-20x speedup by rewrite in C/C++/Go/Rust/etc.

Network is IO too. Network is pretty fast, datacenter-to-datacenter, but end users can still have their experience improved with better encoding and protocol; and outbound bandwidth bills can be improved by that too.


Well, it depends on how far your datacenters are - in the end you're still limited by the laws of physics (speed of light). So 'fast' may actually be several ms, which might be a lot, or not a lot, depending on the problem you're trying to solve.


Wouldn’t memory allocation still be IO of a different resource? We’re still getting slowed down reading and writing bits to a storage device. Perhaps it’s not the hard drive but the claimed blocker here doesn’t appear to be CPU.


Reading and writing to main memory is not usually called “IO” in this context.


That discussion reminds me of Intel Optane. The current distinction between hard disks and RAM isn’t a necessity dictated by some law of nature. Yet it’s the box most people think within (for good reason).


Not really, no. Allocation involves reading/writing to memory, but that's not why it's slow. It's slow because allocation involves a context switch into the kernel. And an allocation algorithm itself isn't trivial.


user space code will not use the kernel and there will be no context switch. it will call a function such as malloc, which is a user-space function. malloc will then go on to interact with the memory allocation sub-system, which may occasionally need to ask the os for more memory via a system call, if the allocator is any good.


Yeah malloc is user space code, but it will do a syscall like sbrk to actually get memory from the kernel.

The default malloc in glibc does not pad the values given to sbrk, so you have to do a syscall for every 4k chunk of memory (the pagesize). So unless you do lots of very small (<<4k) allocations, you call sbrk pretty often.

You will also page fault when you access the new pages, and this traps into kernel code again.

So yeah, you are technically correct that some allocations may be fast because the memory is already available and mapped. Allocations, on average, are still slow because it involves context switches to the kernel (potentially multiple).

TLDR: you make it sound like a syscall within malloc is rare, but many/most allocations will trigger a syscall.


No, for a start some allocators don't even use sbrk (all mmap), and glibc's ptmalloc will use mmap for large allocations, ie you can allocate multiple megabytes with a single syscall with ptmalloc. Granted, if memory has been mmaped, you will pay a page fault because of the zero page optimization when you first write to it.

Furthermore, memory which is freed is often not returned to the os, either for fragmentation (you've used sbrk..) , or performance reasons (minimize syscalls), and put in a free list instead. The next call to malloc then will not require a syscall, if it can be satisfied with existing freed blocks.


It is rare because one of the very first things anyone does if they're concerned about allocations is replace glibc malloc.


Indeed - for that particular problem a big cost is the "IO" from main memory "into the CPU cache". Ben is careful to qualify it as "disk" IO, but even this is somewhat vague. (as is "CPU cache" vs L3/L2/L1 - Ben's interview problem is highly L2 sensitive, I think, and increasing variety in that size will make results harder to interpret.)

On a modern gen4 NVMe, I routinely get 7 GiB/s. gen5 is supposed to double that (as soon as manufactures get "enough" money out of gen4 given PCIe4's extremely short life compared to gen3's extremely long one.)

There was a time not long ago (maybe still) where highly scaled up, many core (40+) Intel CPUs could not match that getting from DIMMs into L3 for just 1 core (as per his interview problem). So, we are indeed moving into an era where "IO" from the primary persistent device is indeed no worse than IO from DIMMs, at least in bandwidth terms. DIMMs still have much better latency and the latency-BW ambiguity has been observed elsethread.

EDIT: I should clarify, to connect my text with your comment, that the real cost of (1-core, uncontended) allocation is also more "populating/mapping the cache" with copies, not just "allocation" in itself.


A few ballpark numbers I encountered:

Sequentially reading a file on a spinny laptop disk was about 80-100 MB/s. On an SSD that went up to 400-500 MB/s for me.

That's the sequential case! What about random access? I tried an experiment where I memory mapped a large file and started updating bytes at random. I could get the rate down to kilobytes/sec.

Even though we've all heard that SSDs don't pay as much as a penalty for random access as spinny disks, it's still a huge penalty. Sequential spinny disk access is faster than SSD random access.


> I tried an experiment where I memory mapped a large file and started updating bytes at random. I could get the rate down to kilobytes/sec.

Memory-mapped IO means you're only giving the SSD one request to work on at a time, because a thread can only page fault on one page at a time. An SSD can only reach its peak random IO throughput if you give it lots of requests to work on in parallel. Additionally, your test was probably doing small writes with all volatile caching disallowed, forcing the (presumably consumer rather than enterprise) SSD to perform read-modify-write cycles not just of the 4kB virtual memory pages the OS works with, but also the larger native flash memory page size (commonly 16kB). If you'd been testing only read performance, or permitted a normal degree of write caching, you would have seen far higher performance.


True. And main memory access can easily become slower than SSD sequential access if you do random byte accesses and your working set is larger than the CPU's caches or TLBs.


> Sequential spinny disk access is faster than SSD random access.

It is, but on both kind of drives you'll want to dispatch at least a couple of requests at once to get better performance. In the memory-mapped case, that means using multiple threads.

In addition, you might also want to call madvise(MADV_RANDOM) on the mapping.


I didn't think there'd be that much discussion on my point (seq hdd > rand sdd).

> In the memory-mapped case, that means using multiple threads.

My gut tells me I'd lose more to contention/false-sharing than I'd gain through multithreading - but I haven't done the experiment.


> Sequential spinny disk access is faster than SSD random access.

No it’s not. At least not with modern SSDs or NVMe storage.

Even at 100 MB/s, a spinning disk in sequential mode is doing 100 x 1024 / 4 = 25,600 IOPS (assuming a standard 4K per operation).

Even consumer grade NVMe hardware gets 5-10x of that for random workloads.


> Even consumer grade NVMe hardware gets 5-10x of that for random workloads.

Cool, lots of IOPS!

But like I said, I got it down to kilobytes/sec.


Because you did it the most inefficient way.

This [0] comment is totally on point.

Also note what a consumer SSDs can be made even with a single flash chip. A more performant ones are made of bunch of chips internally (essentially a RAID0 with some magic) so they can do a parallel operations if the data resides on the different flash blocks. Still, if your thread is only doing one operation a time with blocks < flash rewrite block size you will hit the write amplification anyway.

I think if you do the same test but without a memory mapped file (ie let the OS and disk subsystem do their thing) you will get much more speed.

[0] https://news.ycombinator.com/item?id=33751973


On what hardware and how much parallelization? The max IOPS numbers are only when you’re saturating the command queue.

It’s a throughput number, not a single operation, completion, followed by the next one.


it is true that spinning rust is slower than most ssds including nvme ssds even in sequential mode

however, a spinning disk doing a sequential access is not doing 25600 iops

if the sequential access lasts 10 seconds it is doing 0.1 iops


As the algorithm used in the example is straight-forward I figured that using UNIX command line tools might be an even simpler way to implement it. Here is what I came up with:

  time cat kjvbible_x100.txt | tr "[:upper:] " "[:lower:]\n" | sort --buffer-size=50M | uniq -c | sort -hr > /dev/null
On my machine this turned out to be ~5 times slower than the provided Python implementation. Nearly all of the time is spent in the first invocation of `sort`. Further increasing the buffer size doesn't make a significant difference. I also played around with the number of threads `sort` uses, but didn't see any improvement there either.

I'm quite puzzled why `sort is so much slower, especially as it does sorting in parallel utilizing multiple CPU cores, while the Python implementation is single-threaded.

Does somebody have an explanation for that?


Dangit, I'm supposed to be doing yardwork today so you hijacked my procrastination motivation haha!

Edit: I had no idea that awk was so fast, and I suspect that only parallelization would beat it. but I agree with the others that the main bottleneck is the `sort | uniq` for results1.txt

  # https://stackoverflow.com/a/27986512 # count word occurrences
  # https://unix.stackexchange.com/a/205854 # trim surrounding whitespace
  # https://linuxhint.com/awk_trim_whitespace/ # trim leading or trailing whitespace
  
  time cat kjvbible_x100.txt | tr "[:upper:] " "[:lower:]\n" | sort --buffer-size=50M | uniq -c | sort -hr > results1.txt
  real 0m13.852s
  user 0m13.836s
  sys 0m0.229s
  
  time cat kjvbible_x100.txt | tr "[:upper:] " "[:lower:]\n" | awk '{count[$1]++} END {for (word in count) print count[word], word}' | sort -hr > results2.txt
  real 0m1.425s
  user 0m2.243s
  sys 0m0.061s
  
  diff results1.txt results2.txt
  109,39133c109,39133
  # many whitespace differences due to how `uniq -c` left-pads first column with space
  
  diff <(cat results1.txt | awk '{$1=$1};1') <(cat results2.txt | awk '{$1=$1};1')
  # bash-only due to <() inline file, no differences after trimming surrounding whitespace
  
  cat results1.txt | awk '{ sub(/^[ \t]+/, ""); print }' | diff - results2.txt
  # sh-compatible, no differences after trimming leading whitespace of results1.txt
  
  # 13.836 / 2.243 = ~6x speedup with awk


I don't understand how you're getting <2s for that awk result. I'm testing on slightly older hardware, for example I get 4.6s and 11.9s for the optimized and simple go versions taken from the git repo.

But when I also get:

# time cat kjvbible_x100.txt | tr "[:upper:] " "[:lower:]\n" | awk '{count[$1]++} END {for (word in count) print count[word], word}' | sort -hr > results2.txt

real 0m23.174s user 0m23.309s sys 0m1.234s

So my result is 10x slower than yours.

What are you running this on and where do I get one?


Oh gosh, sorry to burst your bubble but it's a 2011 Mac Mini :-P

  2.3 GHz Intel Core i5
  8 GB 1333 MHz DDR3
  Intel HD Graphics 3000 512 MB
  macOS High Sierra 10.13.6 (17G14042)
  512 GB PLEXTOR PX-512M5Pro SSD (Get Info says I installed it July 2, 2011 but it might be a clone of another drive)
<rant>

I really like it, but will probably have to sell it because it has various software failures, like sometimes one of my displays won't turn on or goes black and I have to restart. That bug seems to be fixed on newer macOSs like the one on an Intel MacBook Pro I use for work, but Apple artificially sunsets their hardware by preventing newer versions of macOS from being installed and not back-porting bug fixes to previous macOSs. Since pretty much all computers today are Turing-complete, that feels.. disingenuous.

Computers haven't gotten appreciably faster for roughly 15 years since R&D funding shifted to mobile in 2007 and Moore's Law ended. All that matters today is whether we are using an SSD and how wide the memory bus is, since speed there hasn't changed much either, just latency. And Apple's not the only one treading water. PCs often suffer from mismatched hardware, so maybe an Intel i9 gets installed on a logic board with a memory bus too slow to recruit it. I built a gaming PC a few years back and I may have inadvertently underpowered it by putting most of the budget into the RTX 2070. Since video cards can't do the everyday workloads we're discussing, I mostly consider them a waste of time and mourn what might have been had CPUs kept improving instead.

Apple's Arm M1 is a logical progression off of Intel, but I can't really endorse it, since they chose a relatively complex architecture where a big dumb array of cores would have been more scalable. If some indie brand comes along and builds one of the 1000+ core CPUs I've blabbered on about, I can't say that I'll have much sympathy for the current big players.

Due to all of that, I perceived computers in 2010 as being roughly 1000 times slower than they could/should be had they kept up with Moore's Law, and computers in 2020 as being roughly 1000000 times slower (the ratio of GPU to CPU FLOPs for example). It doesn't help that stuff like Spotlight and Safari eagerly take 100+% CPU or that basically all PCs are bogged down with either spyware or the daemons that supposedly find and remove spyware (thank you M$). Or that we don't have the network computing that Sparc had in the 1990s, where all of the computers on the LAN were available for additional cores seamlessly. Just slow on top of slow on top of slow under surveillance capitalism yay!

</rant>


It's because that first invocation of sort is sorting the entire input (413MB), not just the unique words (less than a MB). The sort is probably O(NlogN), but that's a big N. Counting by inserting into a hash table is much faster, at O(N).


`sort | uniq` is really slow for this, as it has to sort the entire input first. I use `huniq` which is way faster for this. I'm sure there are many similar options.

https://github.com/koraa/huniq


It looks like you're sorting the whole file, while python implementation sorts only unique values.


Related posts from my past experience from about 10 years ago:

* Splitting long lines is slow[1]

* Can Parallel::ForkManager speed up a seemingly IO bound task?[2]

In both cases, Perl is the language used (with a little C thrown in for [1]), but they are in a similar vein to the topic of this post. In [1], I show that the slowness in processing large files line by line is not due to I/O, but due to the amount of work done by code. In [2], a seemingly I/O bound task is sped up by throwing more CPU at it.

[1]: https://www.nu42.com/2013/02/splitting-long-lines-is-slow.ht...

[2]: https://www.nu42.com/2012/04/can-parallelforkmanager-speed-u...


IO also meant network to me. Often the target(database, or device generating telemetry) is 10+ms away. That round trip time is bottle neck by physics(speed of light). side benefit of sqlite being local file system/memory.


"sorting with O(n^2) is no longer a bottleneck as we have fast processors" /s


That makes no sense. Just the brutal math of a polynomial is always going to be poor enough to notice than subpolynomial times.


Often but not always. Cool trick: any bounded limit is always O(1)!

Pick a small enough bound and an O(n^2) algorithm behaves better than an O(n log n). This is why insertion sort is used for sorting lengths less than ~64, for example.


Pick a small enough bound and certain O(n^2) algorithms will behave better than certain O(n log n) algorithms.

Big O notation doesn't take into account constant factors of overhead or plain old once-per-run overhead.


Sorry it was indeed a typo


The title I/O is _no longer_ the bottleneck seems to suggest disk speed has caught up, while in reality the slowness is due to poor implementation (slow Python or Go with lots of allocations).

The real problem to me is that languages are too high-level and hiding temporary allocations too much. If you had to write this in C, you would naturally avoid unnecessary allocations, cause alloc / free in the hot loop looks bad.

Presumably soon enough it's very unlikely you find any new word (actually it's 10 passes over the same text) and most keys exist in the hashmap, so it would be doing a lookup and incrementing a counter, which should not require allocations.

Edit: OK, I've ran OP's optimize C-version [1] and indeed, it only hits 270MB/s. So, OP's point remains valid. Perf tells me that 23% of all cache refs are misses, so I wonder if it can be optimized to group counters of common words together.

[1] https://benhoyt.com/writings/count-words/


I don't know if I can agree. Python is certainly problematic for the reasons you outlined, but Go is much closer to C in terms of performance. Go is also compiled, and as such you have similar opportunity to optimize and measure allocs. In my experience it's much harder to tune Python, and Go is often easier to tune than C due to the great ergonomics of the compiler. You can also drop down to assembly!

But really, I disagree because I've frequently saturated massive IOPS. I/O is still the bottleneck. The article pretty much immediately excludes network I/O, which is in many cases more common than disk I/O. Even so, tiny single-threaded programs reading words one-at-a-time are obviously not going to be I/O constrained with modern disks. For these types of programs, I/O hasn't been a bottleneck in a long, long time, and I'd actually be surprised to hear candidates suggest otherwise.


Let’s call spades spades: C spanks Go in every measure. Go is nowhere near C speed.

It’s certainly more performant than any dynamically typed scripting language: JavaScript, Python, Ruby, etc but it’s probably closer to C#.


Go’s problem is that in practice there are interfaces (fat pointers) everywhere. I’m about to start optimizing one of the services my team owns and my first step is going to be looking for all the bad performance issues that comes out of this.


yeah, a friend of mine wrote a logfile parser in 01996 in c, and when i rewrote it in perl (also in 01996) it was about one tenth the size and ten times slower

whether i/o is the bottleneck depends on what you're doing and on which computer, and that's been true for at least 50 years


May I ask what's the rationale for writing 01996 rather than 1996? I've seen this before but I haven't seen an explanation of the advantage of it.



But it looks like an octal literal.


Haha, I thought it was some architecture I’ve never heard of.


I mean, Go is closer to C than Python is. But Go is still a garbage collected language with a runtime. It's not going to come close to the limits of performance that C offers


For a script like this your be surprised. If you don't do too much allocation, the garbage collector need not run. In fact it's common to disable it at the start. Now you can allocate much faster than C because you never clean up after yourself!

A real world example is esbuild, the author implemented it both Rust and Go initially. The Go version was faster and the code simpler. Which is why it's implemented in Go.


> A real world example is esbuild, the author implemented it both Rust and Go initially. The Go version was faster and the code simpler. Which is why it's implemented in Go.

But why is swc faster than esbuild then? The code isn't even considerably more complex.


Because different programs, implemented differently, run at different speeds...

I'm saying the performance of Go can sometimes be surprisingly fast. Not that it's magic.


So you're saying they wrote exactly the same program in Go and Rust for the comparison, changing the syntax only? Well then it's no surprise the Go version was faster.

Don't write Rust as if it was Go. That doesn't say anything meaningful about either Go or Rust.


Actually he wrote the Rust version first, so you're wrong jumping to conclusions.

I'm not trying to say Go is faster than Rust, it's usually slower. But there are always exceptions to the rule. The Go code, on the other hand, is usually simpler and quicker to write. For that reason I'd prefer Go if the problem lends itself to a garbage collected language.


So what did you mean by this?

> Because different programs, implemented differently, run at different speeds...

We're talking about two programs with exactly the same purpose - ingest TypeScript and output JavaScript. It's a pretty clear-cut comparison, IMHO.

> The Go code, on the other hand, is usually simpler and quicker to write

I'm writing Go code at work, and Rust code mostly for fun (but used it at work too). I'd say this has changed significantly in the last 2 years. Now with rust-analyzer and much improved compiler output, writing Rust is very simple and quick too. I guess getting into Rust can be a little harder if you've only ever used GCed languages before, but it's not that hard to learn - and once you do it's super-effective. And the type inference of Rust is a huge reason why I'm using it - while Go has none.

Another thing to consider - usually the code in Go is much more about writing algorithms yourself instead of using library functionality (this is changing slowly thanks to the new support of generics but most code hasn't caught up yet and there aren't good libs using it so far). The resulting code in Go can be convoluted a lot and contain very hidden bugs. People also usually don't bother implementing a proper search/sorting algorithm for the sake of simplicity/speed of development - which you'd get automatically if you used a library function - so the code is less efficient. My Go code is usually 2-3x longer than the equivalent in TypeScript or Rust.

Go is great, I like it. Rust is great too. I recommend you to do what the esbuild author did - test it and choose for yourself, don't bother too much about others' opinion.


> We're talking about two programs with exactly the same purpose - ingest TypeScript and output JavaScript. It's a pretty clear-cut comparison, IMHO.

There are an infinite number of ways to design two programs for that task, with different trade-offs. You can't draw conclusions about which language is faster based on two different implementations by different people.

> Go is great, I like it. Rust is great too. I recommend you to do what the esbuild author did - test it and choose for yourself, don't bother too much about others' opinion.

I'm actually writing Rust code the last two years. It's been a while since I've used Go. But I'd rather use Go if the problem allows for a garbage collector. It's just simpler than managing it manually in Rust with the borrow checker and its rules. This is my opinion, nobody else's.


The GP didn't say that, maybe they wrote Go as if it were Rust, for all we know.


You can find the comments from the ESBuild author here on HN:

https://news.ycombinator.com/item?id=22336284


Rust compiler is much faster today than it was in 2020, btw.

> This is a side project and it has to be fun for me to work on it.

I respect this 100% - but then we shouldn't assume Go is better than Rust just based on that esbuild used it instead of Rust.


You could LD_PRELOAD a free that doesn't do anything in C too or a macro define to really make it no cost.


You'd need to change malloc too so it doesn't do extra work. You could do it. Go obviously isn't going to be faster than well written C. But sometimes you can be surprised.


Having played with this a previous time it was posted and looking at the profiler results, the main difference between the Go and C isn't due to any memory allocations but rather the lack of in-place update and slightly higher cost from being general-purpose in Go's hash table. (This is also probably why Rust, without the GC overhead, clusters with Go rather than C.)


I'm fairly certain that if one wrote C (or C++) programs with the same safety targets that Rust has out-of-the-box the performance is close, if not (for practical purposes) identical.

Of course, to do much useful (and performant) in Rust one often has to break out `unsafe`, which eliminates some of the out-of-the-box guarantees for safety--and in some cases makes one wonder if it's worth all the overhead instead of just using C or C++.


> the same safety targets

Rust's selling point is that the safety targets' costs are dev/compile-time ones. There should not be a difference unless the C/C++ code requires some IB or extremely manual memory management trickery, which it doesn't; and Go offers basically the same memory safety guarantees as Rust in this regard and is (slightly) faster.

In this case it's really almost entirely about the speed of the hash table.


I was referring to the overhead of development time. Rust is more complex than C, at least, and even after having gained proficiency coding in Rust still takes more time than writing an equivalent C program.

And "zero-cost" is misleading. There are definitely performance impacts from the implicit (and unadvertised explicit) bounds checking some of Rust's features come with. Writing a C program to an equivalent level of safety would have similar performance impacts. Hence, for as close to the same safety as possible, Rust and C should be almost identical in terms of performance.


That's probably the reason why Go got many people switching from their scripting languages and didn't end up as the C killer it was sold in the beginning.


That's exactly my reason to use it: fast prototyping and ability to replace bash scripts (as they become too unwieldy to maintain and expand from one point and on).

I've also successfully made an MVP with Golang which I then proceeded to rewrite in Rust in almost only one go and almost without blockers along the way.

Golang is pretty good but it still lacks important things like algebraic data types, and they're hugely important for fearless refactoring and correctness.


I never remember it being sold as a C killer maybe as a Java killer?


It was sold as C killer in the beginning.

And then a few years later there was an article that said, the Go engineers were surprised when they saw C/C++ coders weren't switching to Go rather Python/Ruby coders were "upgrading" to Go.


golang.org circa 2010:

> go is ... fast

> Go compilers produce fast code fast. Typical builds take a fraction of a second yet the resulting programs run nearly as quickly as comparable C or C++ code.

https://web.archive.org/web/20100217123645/http://golang.org...

That seems to me like they were trying to say "If you want C/C++ performance but nicer/easier syntax, you can use Go", which turned out to be not that true in the end.

Edit: the old "Language Design FAQ" also goes further in detail on how the envision (the first version of the) language: https://web.archive.org/web/20100211104313/http://golang.org...


I believe Go was created to replace python at Google.


It was more sold as a C++ killer than a C killer.


The Go code is reasonably efficient at avoiding allocations, though it's hard to avoid some. Without thinking too hard, it's also going to be hard to avoid those same ones in C, and some possible improvements would apply to both (e.g. defining a maximum word length).

"Lowercase word count" is a surprisingly difficult case in this regard, because you need to check and potentially transform each character individually, and also store a normalized form of each word. Probably some smart SIMD lowercase function could help here but I don't think any language is going to offer that out of the box. It's also defined in a way I think detaches a bit much from real-world issues - it's handling arbitrary bytes but also only ASCII. If it had to handle UTF-8 it would be very different; but also if it could make assumptions that only a few control characters were relevant.


> but I don't think any language is going to offer that out of the box.

That's what compilers are for. I tried to improve the C version to make it friendlier to the compiler. Clang does a decent job:

https://godbolt.org/z/o35edavPn

I'm getting 1.325s (321MB/s) instead of 1.506s (282MB/s) on a 100 concatenated bibles. That's still not a 10x improvement though; the problem is cache locality in the hash map.


Note: Just concatenating the bibles keeps your hash map artificially small (EDIT: relative to more organic natural language vocabulary statistics)...which matters because as you correctly note the big deal is if you can fit the histogram in the L2 cache as noted elsethread and this really matters if you go parallel where N CPUs*L2 caches can speed things up a lot -- until your histograms blow out CPU-private L2 cache sizes. https://github.com/c-blake/adix/blob/master/tests/wf.nim (or a port to your favorite lang instead of Nim) might make it easy to play with these ideas (and see at least one way to avoid almost all "allocation" - under some interpretations).

A better way to "scale up" is to concatenate various other things from Project Gutenberg: https://www.gutenberg.org/ At least then you have "organic" statistics on the hash.


> I don't think any language is going to offer that out of the box.

C# offers that out of the box, and the solution is much simpler there.

Pass StringComparer.OrdinalIgnoreCase or similar (InvariantCultureIgnoreCase, CurrentCultureIgnoreCase) to the constructor of the hash map, and the hash map will become case-agnostic. No need to transform strings.


The one possible issue is by not transforming the string you're going to run the possibly less efficient CI comparison a lot: because the corpus is text and duplicated, by my reckoning there are ~32000 unique "words" out of 82 million inout "words". That's a lot of conflicts.

Though the values should mostly be quite short, so a vectorised comparison might not even trigger as it wouldn't have the time to "stride": only one word of the top 10 even exceeds 4 letters ("shall", a hair under a million in my corpus).


The C# standard library will not run many case-insensitive comparisons. That comparison object doesn’t just compare two objects for equality, it also has another interface method which computes a hash of a single object.

Here’s implementation of the hash function used by that StringComparer.OrdinalIgnoreCase: https://source.dot.net/#System.Private.CoreLib/src/libraries... As you see, it has a fast path for ASCII-only input strings.


> The C# standard library will not run many case-insensitive comparisons. That comparison object doesn’t just compare two objects for equality, it also has another interface method which computes a hash of a single object.

Which doesn't matter because I'm talking about identical strings, so they will hash the same by definition, and they will have to be compared.

So the question is how fast the CI hash and equality operate compared to the CS ones.

And I asked about comparison because I assumed that would be the costlier of the two operations, relative to its CS brethren.


> how fast the CI hash and equality operate compared to the CS ones.

If the string is ASCII like in the OP’s use case, I think the difference is not huge.

CS comparison looks more optimized, they have an inner loop which compares 12 bytes as 3 64-bit values: https://source.dot.net/#System.Private.CoreLib/src/libraries...

CI comparer doesn’t do that, it loads individual UTF-16 elements: https://source.dot.net/#System.Private.CoreLib/src/libraries... But still, it’s very simple code which does sequential memory access.

> And I asked about comparison because I assumed that would be the costlier of the two operations, relative to its CS brethren.

I think the bottleneck is random memory loads from the hash table.

Hashing and comparison do sequential RAM access. The prefetcher in the CPU will do its job, you’ll get 2 memory loads every cycle, for short strings going to be extremely fast. If that hashtable doesn’t fit in L3 cache, the main memory latency is much slower than comparing strings of 10-20 characters, no matter case sensitive or not.


But all of those optimizations can also be done, and more efficiently, by transforming while reading. Even if they're as fast as they can be, they're not as fast as a memcmp. The C# approach isn't buying you any performance here.


Yeah, it’s possible to change case while loading, but I’m not sure that’s gonna be much faster.

But I’m sure it gonna be much harder.

For non-ASCII strings, converting case may change their length in bytes. You don’t even know in advance how much memory you need to transform 2GB of input text (or 1MB buffer if streaming). And if streaming, you need to be careful to keep code points together: with a naïve approach you gonna crash with a runtime exception when you split a single codepoint between chunks.

English words are 99.9% ASCII, but that remaining 0.1% like “naïve” is not. The C# standard library is doing the right thing for this use case. Specifically, for 99.9% of words the CI comparer will use the faster ASCII-only code to compare or hash, and only do the expensive shenanigans for small count of non-ASCII words.

Note how C# makes the implementation much simpler. A single parameter passed to the constructor of the Dictionary<string,Something> makes it implement case-insensitivity automagically.


> Probably some smart SIMD lowercase function could help here but I don't think any language is going to offer that out of the box.

No, but at least you have direct access to the intrinsics in C. To get vectorization in Go, you have to implement it in C and link that into your Go program.


> To get vectorization in Go, you have to implement it in C and link that into your Go program.

Go has an assembler which does not require implementing anything in C (though the assembler uses a more C-style syntax), nor critically does it require using CGo linkage. It's used to implement many hot paths in the stdlib.


Typically you would just implement vectorization in assembly in the Go case.


The file is only being read once according to the blog and then it is in memory. This isn't an I/O intensive application at all.

I have written a small script in python that does something similar. I have a word list with 1000 words and I check the presence of the words. Here is the thing. For every word I go through the entire file. So lets say I scan the file one thousand times. In fact I did something more complicated and ended up going 6000 times over the original file and yet it still took only three seconds. If all these scans had to reread the file it would take forever.


This sounds like a very inefficient method. The Python stdlib module 'collections' includes a Counter class which can do this for you. Then all you need to do is copy the dict key/item pair to a new one that match your word list.


he has figures for both cached and uncached performance

if all your scans had to reread the file from nvme it would take five times as long if we extrapolate from those figures

not forever


the optimized version does it byte by byte


Of course disk speed has caught up. Single core clock speed hasn't changed in over 10 years now? Disk now uses SSD and has been chugging along with improvements.

I'm curious what a multi thread(multi process for python due to GIL?) comparison would be. Obviously people aren't doing this by default though, so the author's point still stands.


> Single core clock speed hasn't changed in over 10 years now?

This is not true. I thought the same thing and you are right in regards to base line clock speed. But the performance is still increasing. I just got a new PC at work with a 2021 high-end CPU and it has 200% single core performance of the 2015 mid-range CPU in my private PC:

https://www.cpubenchmark.net/compare/4597vs2599/Intel-i9-129...


Well, CPU got a lot better at benchmarks, that is true. Caches got bigger, predictions got better. Specialized instructions were added. IPC improvement kinda slowed down after Sandy Bridge, at least for Intel.

Also, the comment you're quoting is talking about clock speed, and the link you provided literally shows the same base clock speed - 3.2 GHz. Intel progressively pushed turbo speed higher, but that the speed you could have achieved yourself by overclocking.


Does the CPU constantly hold the turbo speed under a single threaded workload?


Depends. The cpu attempts to hold the maximum possible speed on any cores that are in use.

On my water cooled and specifically tweaked desktop- yes. It’ll hold max boost indefinitely, even with all threads. (getting to about 80c after 10 mins). Single-thread max is faster, and it’ll hold that as well.

My laptop will pull power within 15 seconds and be down to base clocks in a couple mins. Unless I set it down outside and it’s very cold.


Most un-tweaked chips are going to be below 25 watts with a single core loaded, and lots of laptops can cool that without any problems.


It depends on motherboard and cooling. 6700K, for example, is constantly running at 4.2Ghz or 4.5Ghz (winter clocks). Constantly while thermals allow it... Non-overclocking motherboards allow it to boost for 2 minutes, after that, it's iffy.


Are you simply glossing over the fact that CPU 1 has a turbo of 5.2 Ghz and the other 3.6 Ghz ?


Yes, I did not go into details where the differences are coming from. The base clockspeeds are the same, which is obviously the number that GP noticed not changing over the last 10 years.

Other things changed. The turbo clockspeed is one of them.


Also, the the old one is 65W TDP and the new one is 241W TDP.


> Single core clock speed hasn't changed in over 10 years now?

Technically we have gone from 3.3gz- 3.9gz base frequency, 4.1-4.2 single core boosts, 4.7ghz heavy OC in Haswell to

4.3 and 3.0 base clock for efficiency/perf cores, with boosts to 5.4 ish, stable 5.8ghz all core frequency with heavy OC, and even pushing 6.something GHz single core with a good setup.

But then again, this is on the more extreme ends. Mainstream laptops have remained relatively stable.

re Python MP:

On my 12700k, I saw almost linear scaling (per core, not thread) with multiprocessing when loading stuff from the disk for ML applications and processing. This was with the datasets library.


It's not just about clock speed. CPI has been steadily improving across processor generations, even when the clock speed doesn't meaningfully change.


Sorry do you mean IPC (instructions per clock) or is there another term that I’m not aware of?


Cycles per instruction.


"Cycles per instruction" sounds more like a measurement of instruction latency rather than instruction throughput, and I don't think instruction latency has improved as much in the past decade as instruction throughput: the most significant microarchitecture trend has been toward wider CPUs with more execution ports, wider instruction decoders, and larger reorder buffers.


Yah, that was my thought exactly. Its probably not just the language choice, but the chosen algorithm. I expect a word frequency counting algorithm to run at close to the speed of sequential main memory read speeds. So the bottleneck in real situations is likely how fast the I/O subsystem can get it into RAM.

(a quick glance at the C version tells me all I need to know with the scanf, strdup, etc. This is sorta like those C vs Java benchmarks which boil down to the fact that the default (glibc/etc) C malloc() isn't really optimized for perf, and what your really bench marking is how terrible it is for frequent tiny allocations. I think there _IS_ a diffrence between the code written by someone who lives C in a system programming context, and people who dabble in it from a higher level language.).

PS: Say using a hash to track a words frequency, likely works OKish but the need to keep it small, while still avoiding collisions would cause problems if the algorithm were parallelized in a long running system (rather than batching, where you would just run each thread against its own copy of the frequency tables, and then merge them at the end). But I'm still fairly certain that "your doing it wrong" if the overall algorithm can't run at a significant fraction of main memory bandwidth.


Don't forget about naive implementations in languages like C++ where using cin/cout can cause your program to run slower than Python when doing file IO


Yeah I’ve worked on non trivial javascript codebases where allocations and gc mattered.


I’d like to hear more about this.


Not your OP, but I've got stories!

PhotoStructure is a non-trivial app written in TypeScript for both frontend and backend tasks.

Coming from Scala, I initially used a lot of Array.map, Array.foreach, and, to handle things like Some/None/Either:

function map<T,U>( maybe: <T|undefined>, f: (t:T) => U )

According to the V8 memory inspector, all those tiny fat-arrow functions can hammer the GC (presumably due to stack allocations and pulling in local context). Replacing large array iteration with for loops was a big win.

Also, handling large files with a streaming parser when possible, instead of reading them entirely into memory, another win.

Buffer concatenation may be faster by pushing read chunks onto an array, and joining the lot at the end, rather than incrementally appending with .concat.

When memoizing functions, if they themselves return functions, watch out for fat-arrows if you didn't need the caller's context (which may prevent gc from unexpectedly retained variables).

But the first step should always be to profile your app. You can't assert improvement without a baseline.


Networking is I/O: API calls, database access, etc. - it's not just disk access. The article is deriving a generalised statement based on a very specific use case.


Interesting observation, but I think the author crosses a bridge too far here.

> If you’re processing “big data”, disk I/O probably isn’t the bottleneck.

If it fits on a single machine, it is by definition not big data. When you're dealing with really big data, it's likely coming from another machine, or more likely a cluster of them. Networks can also be pretty fast, but there will still be some delay associated with that plus the I/O (which might well be on spinning rust instead of flash) on the other end. Big data requires parallelism to cover those latencies. Requires. It might be true that I/O is no longer likely to be the bottleneck for a single-threaded program, but leave "big data" out of it because in that world I/O really is still a - if not the - major limiter.


the author is evidently pretty unskilled and unaware of it in a lot of ways, this being one of them


I am working on a project [0] to generate 1 billion rows in SQLite under a minute and inserted 100M rows inserts in 33 seconds. First, I generate the rows and insert them in an in-memory database, then flush them to the disk at the end. To flush it to disk it takes only 2 seconds, so 99% of the time is being spent generating and adding rows to the in-memory B Tree.

For Python optimisation, have you tried PyPy? I ran my same code (zero changes) using PyPy, and I got 3.5x better speed.

I published my findings here [1].

[0] - https://github.com/avinassh/fast-sqlite3-inserts

[1] - https://avi.im/blag/2021/fast-sqlite-inserts/


Thank you for this (albeit a different project). My immediate question when I read the last part of the page was 'What about PyPy!?' PyPy might be different for the original post's execution, but I still assume some significant speedups?


> PyPy might be different for the original post's execution, but I still assume some significant speedups?

Yes, I believe so as well. They have done many CPU optimisations, so it is likely to be faster


I would have thought that allocations in managed languages like Go/Python would have been the "fast" part of the processing. Isn't technically the GC that's slowing you down, and not the allocation per se? For one-shot input/output programs like these I guess you could tune the GC to kick in with less frequency.

You also note that reading a file sequentially from disk is very fast, which it is, but there is no guarantee that the file's contents are actually sequential on disk (fragmentation), right? We'd have to see how the file was written, and I guess at worst you'd be reading sequential chunks of a hand-wavy 4KB or something depending on the file system and what not. I'm sure others can fill in the details.

Just nit-picking here.


Files aren't stored sequentially in an SSD anyway, they are scattered all over the place physically on different blocks just due to the way SSDs work. This doesn't hurt their sequential read performance at all since they have no seek time they already have a lookup table from the physical block the OS sees to a virtual location within themselves.

However one thing I found out a few years ago is that old data can be slow to read as a lot of error correction kicks in. Additionally a lot of fragmentation at the operating system level in Windows has quite a bit of overhead. It can seriously degrade performance to about 50MB/s sequential reads. In practice defragmentation/rewriting of certain high write files may be necessary on SSDs because Windows read performance degrades at high levels of fragmentation.


> You also note that reading a file sequentially from disk is very fast, which it is, but there is no guarantee that the file's contents are actually sequential on disk (fragmentation), right?

Correct. And there are actually two layers of fragmentation to worry about: the traditional filesystem-level fragmentation of a file being split across many separate chunks of the drive's logical block address space (which can be fixed by a defrag operation), and fragmentation hidden within the SSD's flash translation layer as a consequence of the file contents being written or updated at different times.

The latter can often have a much smaller effect than you might expect for what sounds like it could be a pathological corner case: https://images.anandtech.com/graphs/graph16136/sustained-sr.... shows typically only a 2-3x difference due to artificially induced fragmentation at the FTL level. But if the OS is also having to issue many smaller read commands to reassemble a file, throughput will be severely affected unless the OS is able to issue lots of requests in parallel (which would depend on being able to locate many file extents from each read of the filesystem's B+ tree or whatever, and the OS actually sending those read requests to the drive in a large batch).


I/O is still often the bottleneck. My laptop can handle 11 GB/s through RAM (and no NVME, so under 1 GB/s through the hard drive), less with unpredictable I/O patterns (like a hash-map) and 7600 GB/s through the CPU. Unless the thing you're doing is particularly expensive per byte of data, you're going to be limited at a minimum by RAM I/O, and maybe by DISK I/O.

FWIW, all my recent performance wins have either been by reducing RAM I/O or restructuring work to reduce contention in the memory controller, even at the cost of adding significantly more work to the CPU.


If you ignore latency sure. Optane is still the fastest storage by quite a bit. Flash has yet to catch up and might never do so.

Tons of files and random writes can bring even an enterprise flash ssd to its knees but Optane keeps on trucking


Processing cache hot data has never been the bottleneck.

Running some search on a file on your 486-with-8-megs-of-RAM running Linux, where the file was in the operating system's cache, was dependent on the performance of the program, and the overhead of reading data from the cache through syscalls.

You can't handwave away the performance of the program with the argument that it will hide behind I/O even if that is true for cache-cold run because cache-hot performance is important. People run multiple different kinds of processing passes on the same data.


"Optimised" code according to the author. What can we do to optimise further?

- Read file in one thread pool, streaming the chunks to...

- ...another thread pool, tokenise, count, sort the chunks and send them to ...

- ... merge in another thread pool. (basically map-reduce).

- please stop malloc'ing for each token

- prealloc map for found tokens (better to just allocate room for 200k words).

- SIMD would optimise your inner-loop quite a lot. However, there are optimised libraries for this, so you don't have to write this yourself.

- `word = append(word, c)` <= this is very slow

Why is there no profiling? Why don't you check how the compiler interpreted your code and benchmark the subparts?

In addition, there are at least some errors in your optimised program:

- you can't lower-case by substract like you do. Non-ascii characters would fail.

- also you can't tokenising by comparing with c <= ' '. There are many characters which would break a string. See this exercise: https://campus.datacamp.com/courses/introduction-to-natural-...


> thread pool

May reduce wall-clock time but increase total compute time (and so also power). It's less an optimization than a tradeoff.

> please stop malloc'ing for each token

It doesn't, only when it gets put in the map. (And while the particular allocation could be smaller, something guaranteed to represent a specific arbitrary-length string has to be put in the map, which is going to malloc.)

> prealloc map for found tokens (better to just allocate room for 200k words).

Has no meaningful effect on performance.

> SIMD would optimise your inner-loop quite a lot.

No, as pointed out elsethread, it's a measurable boost but nowhere near the 10x you need to make the main claim (I/O not the bottleneck) be wrong. Not even 2x.

> `word = append(word, c)` <= this is very slow

Has no meaningful effect on performance.

Perhaps you should read the whole post.


> It doesn't, only when it gets put in the map. (And while the particular allocation could be smaller, something guaranteed to represent a specific arbitrary-length string has to be put in the map, which is going to malloc.)

You could use one big buffer for all your words. Arguably that's bump allocation but it's much simpler than malloc.


I also tried this and didn't see much improvement. Go already uses a fast allocator for small sizes so they are likely to all end up similar memory regions regardless. A bump allocator reduces the GC pressure a tiny bit compared to that, but that's not significant.

The real alloc win would likely be some kind of small-string optimization, which Go (specifically, the requirements of its precise GC) makes difficult. This is probably my biggest performance frustration with Go, 16 bytes for a string and especially 24 for a slice is so much waste when often 99% of your data is smaller than that.


I/O is no longer the bottleneck for silly interview questions for the most part. But for real programs it can still be an issue.


I like the author started with measuring and thinking bandwidth, which makes sense for streaming through a big file, so I'd have continued that way towards a diff design & conclusion

Continuing with standard python (pydata) and ok hw:

- 1 cheap ssd: 1-2 GB/s

- 8 core (3 GHz) x 8 SIMD: 1-3 TFLOPS?

- 1 pci card: 10+ GB/s

- 1 cheapo GPU: 1-3 TFLOPS?

($$$: cross-fancy-multi-GPU bandwidth: 1 TB/s)

For streaming like word count, the Floating point operation (proxy for actual ops) to Read ratio is unclear, and the above supports 1000:1 . Where the author is reaching the roofline on either is a fun detour, so I'll switch to what I'd expect of pydata python.

It's fun to do something like run regexes on logs use cudf one liners (GPU port of pandas) and figure out the bottleneck. 1 GB/s sounds low, I'd expect the compute to be more like 20GB+/s for in-memory, so they'd need to chain 10+ SSD achieve that, and good chance the PCI card would still be fine. At 2-5x more compute, the PCI card would probably become a new bottleneck.


Haven't gone through the code, but measurement methodology seems wrong to me.

> As you can see, the disk I/O in the simple Go version takes only 14% of the running time. In the optimized version, we’ve sped up both reading and processing, and the disk I/O takes only 7% of the total.

1. If I/O wasn't a bottleneck, shouldn't we optimize only reading to have comparable benchmarks?

2. Imagine program was running 100 sec, (14% I/O) so 14 seconds are spent on I/O. Now we optimize processing and total time became 70 seconds, if I/O wasn't a bottleneck, and we haven't optimized I/O, total disk I/O should become 20% of total execution time, not 7%.

Disk I/O:

> Go simple (0.499), Go optimized (0.154)

clearly, I/O access was optimized 3x and total execution was optimized 1.6x times. This is not a good way of measurement to say I/O is not a bottleneck.

I agree though things are getting faster.


Does someone have a comparison between common server ssds and consumer ssds? I wonder if the speed is equal or not


Hard drives are still about 150MB/s.

SATA SSDs are limited to 550MB/s.

PCI-E 3.0 SSDs more like 3500 MB/s.

PCI-E 4.0 SSDs are 7000MB/s.

All of these are at consumer level pricing, you can get 2TB of PCI-E 4 from Western Digital for £130 at the moment, usually about £180. The issue is sustained writes more than reads for consumer verses enterprise drives where the speed drops off due to a lack of SLC caching and lack of cooling and TLC/QLC which is slower for sustained writing.

The example given is very much a consumer level device and not a particularly quick one by today's standards. You can also expect much faster reads cached than that on a DDR5 system I suspect.


It should be noted that those SSD speeds are all protocol limits rather than NAND flash limits. ~7GB/s is literally the maximum speed PCIE4 can provide, likewise ~3.5GB/s for PCIE3 and ~500MB/s for SATA3.


> the maximum speed PCIE4

4-lane PCIe, as most nvme drives are. I havent seen drives with wider lanes though...


Ah, right. Lanes. I should have mentioned those as well, thanks for catching that.


For me main difference between hdd and nvme is not really the throughput but the acces time. When you manipulate small files it's a lot more important


Todays large hard drives are over 250MBs in sequential operations


Consumer SSDs are trash for sustained writes since they get their top speeds from cache and use slower NAND. Enterprise SSDs tend to have better write endurance, and faster NAND. I have a small Ceph cluster and as an example when I first bought SSDs for it I tried consumer Samsung 870 Evo's. They performed worse than spinning rust.


Consumer SSDs don't really use slower NAND; the QLC vs TLC ratio might be higher for the consumer SSD market than the enterprise SSD market, but a consumer TLC drive is using the same speed of NAND as an enterprise TLC drive (and likewise for QLC drives).

Enterprise SSDs only really have significantly higher endurance if you're looking at the top market segments where a drive is configured with much more spare area than consumer drives (ie. where a 1TiB drive has 800GB usable capacity rather than 960GB or 1000GB). Most of the discrepancy in write endurance ratings between mainstream consumer and mainstream enterprise drives comes from their respective write endurance ratings being calculated according to different criteria, and from consumer SSDs being given low-ball endurance ratings so that they don't cannibalize sales of enterprise drives.

Your poor Ceph performance with Samsung consumer SATA SSDs wasn't due to the NAND, but to the lack of power loss protection on the consumer SSDs leading to poor sync write performance.


If that had been true what you say then we wouldn't be able to saturate the PCIe 3.0 x4 bus with consumer NVMe SSD which we absolutely can. The biggest difference is in the durability as mentioned in the comment below.


Read speeds are similar between consumer and enterprise SSDs; they use the same flash and there's overlap with high-end consumer SSDs using the same controllers as entry-level and mid-range enterprise SSDs.

The main difference is in write performance: consumer SSDs use SLC caching to provide high burst write performance, while server SSDs usually don't and are optimized instead for consistent, sustainable write performance (for write streams of many GB).

Server SSDs also usually have power loss protection capacitors allowing them to safely buffer writes in RAM even when the host requests writes to be flushed to stable storage; consumer drives have to choose between lying to the host and buffering writes dangerously, or having abysmal write performance if the host is not okay with even a little bit of volatile write caching.


I/O is sometimes the bottleneck. On Windows any operation with lots of small file operations bottlenecks on NTFS and Defender operations. It makes some applications like git that run beautifully on Linux need to take some countermeasures to operate well on Windows.


Storage and compute separation is key to scaling data workloads. Here, scaling could be w.r.t volume/shape of data, number of concurrent jobs on the same dataset, complexity of each job etc. In such an architecture, network access is unavoidable. And, to if you have multiple jobs competing for access to the same dataset concurrently, your sequential access can turn into semi-random access. You also have concerns about utilization of resources while being scalable w.r.t arbitrary bursty contentious workloads. These are the things that make it complex w.r.t managing IO resources.


Having dealt with several data parsers I would like to somehow estimate how much electricity is burned globally just for lazy implementations. E.g. in .NET non-devirtualized `Stream.ReadByte` is often one of the hottest methods in a profiler. It and related methods could easily be responsible for double-digit share of CPU when processing data. I mean, it's not IO but just pure overheads that disappear with custom buffering where reading a single byte is as cheap as it should be.


The correct way to describe his experiment should be:

Of course I/O is the slowest, but it is fast enough to let most of the programmers not able to fully utilize it.


> Some candidates say that sorting is going to be the bottleneck, because it’s O(N log N) rather than the input processing, which is O(N). However, it’s easy to forget we’re dealing with two different N’s: the total number of words in the file, and the number of unique words.

I don't see how that changes anything. There's a reason we use Big O rather than other notations. Their answer would still be correct.


input processing is o(n), sorting the unique words is o(m log m)

whether o(m log m) is bigger or smaller than o(n) depends on the relationship between n and m

in the case used by another poster in this thread, concatenating some large number of copies of the bible, m is constant, so o(m log m) is o(1)

in another possible case where the corpus is generated uniformly randomly over all possible words of, say, 20 letters, though in theory o(m log m) is o(1), over a practical range it would be o(n log n)

more typically m is proportional to some power of n depending on the language; for english the power is 'typically between 0.4 and 0.6'

as it turns out o(√n log √n) is less than o(n)

a totally different issue is that big-o notation doesn't always narrow down where the bottleneck is even if you calculate it correctly because n is never arbitrarily large and so the bottleneck depends on the constant factors


> a totally different issue is that big-o notation doesn't always narrow down where the bottleneck is even if you calculate it correctly because n is never arbitrarily large and so the bottleneck depends on the constant factors

Yeah I think this is part of my point, in that theoretical best, average and worst case scenarios can be useful in isolation but rarely tell the whole story.

In the worst case, every word is unique and is equal to word count and it's a standard sorting issue.


No. Big O tells us the theoretical best case for how the compute time will increase as the input size increases for a given algorithm. It does not say anything about the performance of a specific program on a specific input.

The author’s point is that if we have two different inputs A and B, and A is sufficiently smaller than B, then the runtime of B will dominate the program. This would be true even if we have to operate on A with a very slow algorithm.

For example suppose you have some extreme scenario where you have to do three coloring on A and then find the minimum value of B. The runtime would be O(2^A) + O(B). Just plug in numbers and you can see that if B > 2^A then B takes longer even though that algorithm is much faster. If you suppose B is always much larger than 2^A, then B is your bottleneck.


It really depends on the number of unique words in the input file. If it's all unique words, sorting can become the bottleneck.


Article itself is fine but the "conclusion" is loony. You can't draw conclusions about big data from toy experiments with small data.


On a related note, John Ousterhout (in the RAMCloud project) was basically betting that the latency of accessing RAM on another computer on a fast local network will eventually become competitive to local RAM access.

https://ramcloud.atlassian.net/wiki/spaces/RAM/overview


Nonsense. Latency matters. NVMe latency is measured in microseconds, while DRAM latency is measured in nanoseconds.

Sequential processing is not that common.


latency matters but sequential processing is still most processing

i hope you are doing well


As it often goes with these types of interview questions, there's a lot of context missing. What is the goal? Do we want readable code? Do we want fast code? Are we constrained somehow? It seems here the author doesn't really know, but kudos to them for examining their assumptions.

As a side note, a trie would be a neat data structure to use in a solution for this toy problem.


In the last few years I’ve been quietly moving database “rows” from databases to the disk.

Back in the day accessing data from MySQL was actually slower than current SSD speeds. And now you can get all sorts of benefits for free: hard link deduplication, versioning, live backup, easy usage of GNU tools...

I don’t discuss this with certain types of colleagues, but the results are excellent.


Hmm. I just fought with a MariaDB installation that, when set to immediate write to a disk, became rather slow. 7-8 INSERTs into a DB could easily take 3 seconds; unfortunately the internal logic of the system didn't really lend itself to one INSERT of multiple lines.

Once I reconfigured innodb_flush_log_at_trx_commit to 2, the UI started being lightning fast.


I'm not surprised. I've seen bulk MySQL reads in Python be CPU-bound. The interesting followup was that parallelizing reads in subprocesses wasn't going to help much because I'd get killed by CPU again when serializing/deserializing between processes. I was capped at a ~2x speedup.


The Samsung PM9A1 is the OEM version of the 980 Pro, a top-tier PCIe 4.0 NVME SSD. What about an older SATA SSD(one without DRAM buffer or HMB), or a 5400RPM hard drive? Also as others have pointed out, random I/O will tank perf, especially simultaneous r/w operations to the same media.


Honestly I find this article too vague, and if you process large amounts of data you rarely do so with orderly reads and writes, even with databases optimized for fast disks (see rocksdb) have disks as a bottleneck even with the most recently developed hardware.


Is there any hardware accelerator / co processor / for the PC that will read a file into RAM autonomously mainframish and notify the OS when the file is fully loaded into memory? (bypassing the CPU entirely)

Leaving the CPU to bother with other things during that time.


Do we need a coprocessor for that? Any protocol using DMA already does exactly this through interrupts or other message passing systems.

DMA capable controllers are everywhere, I don't think you'll find any storage controllers in your modern computer that don't do this.

Of course DMA only operates on byte ranges and not on files, but adding file system parsers to drives and disk controllers sounds like a way to introduce awful bugs and corruption. Assuming the OS keeps the necessary file system structures cached in RAM, plain old DMA should be good enough here.


> Of course DMA only operates on byte ranges and not on files, but adding file system parsers to drives and disk controllers sounds like a way to introduce awful bugs and corruption.

There's an interesting intermediate solution in the form of SSDs that provide a key-value interface instead of a linear block device. That gives the SSD more explicit knowledge about which chunks of data should be kept contiguous and can be expected to have the same lifetime.


>"Any protocol using DMA already does exactly this through interrupts or other message passing systems."

Interesting, I'm only familiar with the classic interrupts approach. What are some of the other common message passing systems used in DMA?


NVMe has memory-mapped doorbell registers associated with each submission or completion queue, which are ring buffers typically allocated in host RAM and accessed by the SSD using DMA. Generating interrupts when a new entry is posted to a completion queue is optional. The host system can instead simply poll the completion queue to check whether any new IO operations have been completed. Depending on how busy the drive is vs how busy the CPU is, forgoing interrupts in favor of polling may reduce latency (fewer context switches when handling a new completion), but can also burn a lot of CPU time.


This is really interesting. I wasn't aware that interrupts were handled differently on NVMe disks. The falling back to polling during busy times is similar to the way network device drivers do NAPI then?


Polling is optional. The Linux driver defaults to using interrupts only, but the nvme driver can be passed an option to allocate one or more queues with interrupts disabled to be used for polled IO. (You can also configure it to use separate queues for reads and writes, instead of just trying to allocate one queue per CPU core.)


Isn't this what a DMA controller does?


That is how I/O works already.

The problem is dealing with an asynchronous filesystem API provided by the kernel.


You mean DMA?


This is already how stuff works.

There are even OS APIs for this - DirectStorage on Windows.


The most common performance problems I've encountered in our projects are: 1) lack of indexes resulting in extensive table scans 2) I/O calls in a loop without batching.

I don't know if it counts as "I/O bottlenecks" or not.


> converting to lowercase

in regards to accuracy, uppercase is the better option:

https://stackoverflow.com/a/65433126


there doesn't seem to be any reasoning or evidence in that post supporting "uppercase is the better option", just that uppercase produces a larger number of word classes, which might be correct or incorrect

tchrist explains in that thread why neither uppercase nor lowercase is the best option:

> Mapping to lowercase doesn’t work for Unicode data, only for ASCII. You should be mapping to Unicode foldcase here, not lowercase. Otherwise yours is a Sisyphean task, since lowercase of Σίσυφος is σίσυφος, while lowercase of its uppercase, ΣΊΣΥΦΟΣ, is the correct σίσυφοσ, which is indeed the foldcase of all of those. Do you now understand why Unicode has a separate map? The casemappings are too complex for blindly mapping to anything not designed for that explicit purpose, and hence the presence of a 4th casemap in the Unicode casing tables: uppercase, titlecase, lowercase, foldcase.

of course 'σίσυφοσ' is not correct as a written word but if you were to encounter it then you should clearly consider it equivalent to 'σίσυφος'


> there doesn't seem to be any reasoning or evidence in that post supporting "uppercase is the better option", just that uppercase produces a larger number of word classes, which might be correct or incorrect

this sentence appears to be nonsense. the code doesnt check "word classes", it cases folds two characters and compares them.


character classes then


it doesn't check character classes either. It literally takes two characters, then uppercases both and compares, then lowercases both and compares. I have no idea where you are getting that it has anything to do with word or character classes, it doesn't.


by 'word class' i meant 'a set of words that are considered equivalent by whatever your equivalency relation is'

similarly for 'character class'

cf. https://en.wikipedia.org/wiki/Equivalence_class

what i thought the linked program did was that it counted how many of those there were

now on looking at it further i can see that it doesn't seem to be doing that but i don't have any idea what it does do

however, it definitely doesn't take into account the information you would need to learn anything about which candidate equivalency relation is better, which is something you'd need to examine at at least a word level, considering examples like größte, Σίσυφος, and the notoriously fatal sıkışınca/sikişinca pair


> doesn't take into account the information you would need to learn anything about which candidate equivalency relation is better

OK, no one said it did that. Its purely comparing characters, which is and always was what I said it was doing. And somehow it took 5 comments before you even decided to actually read the answer. Maybe next time you should start by actually reviewing and understanding what you are commenting on, before making multiple comments.


you cited it to support your proposition, 'in regards to accuracy, uppercase is the better option'

i reviewed it sufficiently to see that it's irrelevant to the question of whether that's true or not, and to pull the actually right answer out of the thread, and quote it above


> you cited it to support your proposition, 'in regards to accuracy, uppercase is the better option'

which is true

> i reviewed it sufficiently

good joke


Further evidence is the fact that optimized SIMD JSON or UTF8 libraries exist. If I/O was the bottleneck, there wouldn't be a need to parse JSON using SIMD.


Compared to L1 cache reference, it certainly still is.


Wouldn’t it be nice if we could specify the allocators in GC languages? Like, why not expose a bump allocator arena to Python with a manual release?


I don't know if C++ counts as a "GC language" per se, but you can write custom allocators for STL containers (eg here: https://betterprogramming.pub/c-memory-pool-and-small-object...).


Usually we mainly run jobs on nfs or similar disks. I/O time would be more significant. Would be nice to run thoses tests on aws


I have a hunch when rewriting the program in C/C++ or Rust this will change significantly.


Writing this in Perl will have significantly different results. Can’t speak for Go but text processing in Python is slow.


Not necessarily considering that he wrote that the Go version tries to avoid memory allocations.


Does anyone have a good resource for reasoning about how to avoid allocations in JavaScript?


The problem here, as with most interview problems, is that it is wholly dissociated from its context; memory contraints, disk I/O, and file size are non-trivial considerations. Shame on people for stare-at-one's-shoes thinking.


People can make reasonable assumptions for interview problems. In this case, assume the code runs on a modern developer laptop, with an SSD and lots of RAM, using some modern OS, is a fine assumption.

If someone asks you a physics question about a car being dropped into the ocean, generally you don't have to worry about the make and model of the car.


But if you read the article, you would see that you'd run afoul with this interviewer doing that, because he has utterly convinced himself that I/O is no longer the bottleneck and coming to the same conclusion is your unspecified task in the interview.


If you have to handle multi million I/O's or files, or GB/s or TB/s bandwidth, just use DAOS.io storage. Awesome fast, scale out and self healing.


Can we just give it more resources? /s


Does any of this really matter?


It would be nice to see this benchmark in Node.js with streams


Loading 1GB json files is still slow.


That's usually not because of IO but because of dumb memory management in the JSON parser implementation. Some JSON parsers build a huge tree out of individually allocated items for each JSON property, or in higher level languages, build an object/property tree out of the JSON structure. That's guaranteed to be slow for large trees.


Not if you use SAX-like (or "evented") parser


If you specify the type of the data, not just the size, then yep: it wasn't really about the number of bytes read, but what the CPU had to do to process it.


I don't want to be that guy but having 1GB json files to begin with seems like the bigger problem.


It's true, I/O is no longer the bottleneck.

The bottleneck now is interviewers who think they have the knowledge and expertise, but do not. However their authoritative position ends up distorting everything, and then said person blogs about it and causes even more problems.


"Don't be snarky." - https://news.ycombinator.com/newsguidelines.html

If you know more than someone else, that's great - but in that case please share some of what you know, so the rest of us can learn. Just putting another person down doesn't help.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...


OMG what did I miss, please expand. What is being distorted and what problems are caused?


Most people don't build their program for NVMe drives, but for cloud. Try benchmarking it in regular instance in AWS/GCP using default SSD and you will know that what interviewees said is true, as they are likely not building it to run a program in "your" laptop.


Holy god what a burn! I absolutely agree though.


It's still very bizarre to me to see people completely write off spinning platter drives as 'old tech'. They're still used everywhere! (At least in my world)

We have all of my teams data on an array of 18tb drives for a 100TB raid10 setup, and a NAS at home doing the same, etc. Even some of our OS drives at work are 7200rpm drives - and we're a computational lab. Why is everyone so intent that these effectively no longer exist? The cost for a decent amount of space with NVME drives is just far too astronomical. We're not all millionaires.


Spinning rust is still by far the cheapest way to store bulk data, and it is fast enough for infrequent access. It's perfectly fine for archival purposes. Large disks are $0.02 / GB while large SSDs are closer to $0.07 / GB.

On the other hand, ignoring my media collection, both my personal computer and server only need a few hundred GB of storage. SSDs are cheap enough that they are a no-brainer: they are physically a lot smaller, quieter, and more resilient. The orders-of-magnitude better IO speed doesn't hurt either, even if most of it is wasted on me.

1TB NVMe drives are now less than $75. I could get a 1TB HDD for $40 or a 3TB one for $75, but why bother?


Even 2TB SSDs can be bought at < $130 now. Plus the fact that consumer motherboard usually have 2 m.2 slot. You can have 4TB of ultra fast storage on a normal consumer level computer without pcie extension cards. Which is overkill for most people. Most consumers probably don't even need that much storage space.


until they start shooting video or finetuning stable diffusion or something

video and archived neural net snapshots are pretty sequential


I don't think you need nvme ssd just to play video. For that kind of usage. SATA ssd is good enough. And neural net snapshots...? Do most people even do AIs?


yeah, that's what i'm saying about video, raw capacity matters more than bandwidth, and below the 100ms level, random seek latency hardly matters at all for video

right now most people don't use neural net snapshots but 25 years ago most people didn't use 3-d rendering, encryption, or video codecs either


I don't see anywhere in this article where spinning disks are written off.

It is sensible to consider the I/O speed of things other than spinning disks. Especially when they are exceptionally commonplace in consumer devices and developer workstations.


They're not used everywhere, most stuff desktop or servers are now based on SSD or equivalent.

Your use case of people running desktop with 100TB is niche, for $100 you can get a very fast 1TB nvme drive now days, which is fine for 99.99% of the population.


>a 100TB raid10 setup, and a NAS at home doing the same

How do you deal with these onerous constraints? Do you have a system for deciding how many copies of archive.org to keep locally?


I know you're being facetious, but here's some fun stats from archive.org.

A few highlights from the Petabox storage system:

No Air Conditioning, instead use excess heat to help heat the building.

Raw Numbers as of December 2021:

4 data centers, 745 nodes, 28,000 spinning disks

Wayback Machine: 57 PetaBytes

Books/Music/Video Collections: 42 PetaBytes

Unique data: 99 PetaBytes

Total used storage: 212 PetaBytes

Considering this data is from a year ago, it's got to be substantially larger now.

Link: https://archive.org/web/petabox.php


I'm not sure what exactly you mean - but we typically store DNA sequencing data from people that want us to analyze it, databases of proteins and scientific articles, ML models, etc. Proper considerations of how to store the data and compress it correctly help a lot, but unfortunately academics are pretty terrible at choosing the right data structures for the job, and often several copies are needed for mirroring the raw data, then the processed versions are updated.


Spinning disk are fine for some workloads. However, I've definitely seen teams engineering around slower spinning discs. NVMe isn't egregiously expensive, and if teams are needing to spend too much work and time thinking about disk speed it probably cheaper in the long run to switch.


I think you know the answer to your question, which is that normal end users who build PCs basically never use them. Power users with NAS and huge porn collections and servers, of course they still do.


Most people are conscious that they're making a performance trade off when they decide to store data in one.


And there is still demand. Otherwise they wouldn't be still produced.


Picking something counterintuitive as interview question is not a great idea. Defeats the purpose - harder to tell whether the candidate is going with conventional wisdom because that's what they think the answer is, or because the candidate thinks that's the answer the interviewer expects.

i.e. You could get sharp candidates that know the correct technical answer but intentionally give the wrong one because they rightly concluded statistically odds are the interviewer is on conventional wisdom wavelength.

Could be good if you're specifically looking for someone good at mindgames I guess


And then compound that by your counterintuitive assertion being generally wrong in most useful cases. He is either interviewing very junior candidates or frustrating a lot of people.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: