It looks like the problem is dominated by reading in the data file. Some fast so...

somat · on Jan 4, 2024

It would be a more interesting challenge if the data file were larger than the memory. I would love to see what people would come up with on some baby vm with 512 mb of ram.

Even more interesting would be small ram, little local storage and a large file only available via network, I would like to see something other than http but realistically it would be http.

pas · on Jan 5, 2024

As the file-memory ratio changes the problem becomes more and more stream processing, right? If the number of cities becomes too much to keep in memory then it becomes a database with "let's see who can find a better index data structure fitting for these I/O patterns and HW" game.

anonymoushn · on Jan 4, 2024

For this problem, changing the file size to not fit in RAM doesn't really make the optimal solutions more interesting

throwaway167 · on Jan 4, 2024

Sixteen thousand Excel 97 spreadsheets?

fny · on Jan 4, 2024

Rather than read the file into memory, memory mapping can be used.

eru · on Jan 4, 2024

Memory mapping (at least on Linux) isn't actually faster than reading the file manually. Especially if you use appropriately sized buffers.

(Of course, the five times in a row might mess with that.)

Animats · on Jan 4, 2024

Right, it's the five times in a row thing that makes an in-memory solution faster. Otherwise, this is a purely sequential one-pass problem, which is how you'd do it in practice.

Parallelism with edge effects is pretty common. Weather simulation, finite element analysis, and big-world games all have that issue. The middle of each cell is local, but you have to talk to the neighbor cells a little.

pclmulqdq · on Jan 4, 2024

You wouldn't want to do this for a huge file. A very fast solution would use a small number of buffers and io_uring (or equivalent), keeping the page table and cache footprint small.

rockwotj · on Jan 4, 2024

Yeah so I had a discussion on Twitter about this, turns out 12GB is small enough to fit into memory, and the author runs submissions by running a solution 5 times in a row, so using direct IO actually hurts because having the kernel cache is a way to enforce the file is in memory for the 4 runs after. I have a direct IO solution with SIMD string search and double parsing, just in C++ (using libraries). It runs in 6 seconds on my 24 core linux box (NVMe).

Code: https://github.com/rockwotj/1brc

Discussion on Filesystem cache: https://x.com/rockwotj/status/1742168024776430041?s=20

pclmulqdq · on Jan 4, 2024

I missed the "5 times in a row." If you do that, yeah, keeping the whole thing in memory is far better.

lifthrasiir · on Jan 4, 2024

> double parsing

In case you haven't noticed yet, the input format guarantees exactly one fractional digit, so you can read a single signed integer followed by `.` and one digit instead.

rockwotj · on Jan 4, 2024

Yeah I missed this originally, and stuff could be faster with this assumption without a full double parser. The fastest java solution dies some near branchless decoding for these

fragmede · on Jan 4, 2024

could you just add the character values eg 49 for ascii 1, and then subtract off the offset once at the end instead of doing atoi on each line?

edit: doh that works for min and max but the average overflows.

gpvos · on Jan 4, 2024

Yes. I'm not sure it'll help, but def worth a try.

winrid · on Jan 4, 2024

Wow, that's pretty fast considering how simple main.cc looks. I do love c++. Nice use of coroutines, too.

alain94040 · on Jan 4, 2024

So you are basically at the mercy of the OS caching algorithm. That sounds like a bad plan for a benchmark. You are not measuring what you think you are (your code), you are measuring the OS caching policy.

xoranth · on Jan 4, 2024

Dumb question. With io_uring, how do you handle lines that straddle between chunks? I'm asking since, AFAIU, the submitted requests are not guaranteed to be completed in order.

(The easiest I can think of is submitting reads for "overlapped" chunks, I'm not sure there is an easier way and I'm not sure of how much performance overhead there is to it.)

moonchild · on Jan 4, 2024

You have to handle it manually. Remember the partial lines at the beginning/end of your chunks and merge them when their mates become available.

charcircuit · on Jan 4, 2024

What is the downside of memory mapping in this scenario? Shouldn't the page table properly handle the case of doing a single sequential read over a range of pages? Accessing the contents of a file doesn't seem like something caching would matter for. Do you mean that reading of sequential pages will keep adding to the cache compared to reading from a single page? That seems like a similar thing as before where they will be the first things gone from the cache anyways.

pclmulqdq · on Jan 4, 2024

Caching and paging almost always matter, even on things like this. The core problem is that the filesystem won't prefetch for you, and you will be waiting to page fault several times over the length of the file. Another problem of the size of the working set is that you will be seeing several slow calls to (essentially) malloc in the kernel to hold all of that data, while using a small, preallocated structure will give you none of that trouble.

eru · on Jan 4, 2024

> Shouldn't the page table properly handle the case of doing a single sequential read over a range of pages?

That's what I used to think, too. But the kernel ain't that smart.

capitol_ · on Jan 4, 2024

But would that really be faster when you need to read every byte of a file?

I thought memory mapping solved a different problem.