It would be a more interesting challenge if the data file were larger than the memory. I would love to see what people would come up with on some baby vm with 512 mb of ram.
Even more interesting would be small ram, little local storage and a large file only available via network, I would like to see something other than http but realistically it would be http.
As the file-memory ratio changes the problem becomes more and more stream processing, right? If the number of cities becomes too much to keep in memory then it becomes a database with "let's see who can find a better index data structure fitting for these I/O patterns and HW" game.
Right, it's the five times in a row thing that makes an in-memory solution faster. Otherwise, this is a purely sequential one-pass problem, which is how you'd do it in practice.
Parallelism with edge effects is pretty common. Weather simulation, finite element analysis, and big-world games all have that issue. The middle of each cell is local, but you have to talk to the neighbor cells a little.
You wouldn't want to do this for a huge file. A very fast solution would use a small number of buffers and io_uring (or equivalent), keeping the page table and cache footprint small.
Yeah so I had a discussion on Twitter about this, turns out 12GB is small enough to fit into memory, and the author runs submissions by running a solution 5 times in a row, so using direct IO actually hurts because having the kernel cache is a way to enforce the file is in memory for the 4 runs after. I have a direct IO solution with SIMD string search and double parsing, just in C++ (using libraries). It runs in 6 seconds on my 24 core linux box (NVMe).
In case you haven't noticed yet, the input format guarantees exactly one fractional digit, so you can read a single signed integer followed by `.` and one digit instead.
Yeah I missed this originally, and stuff could be faster with this assumption without a full double parser. The fastest java solution dies some near branchless decoding for these
So you are basically at the mercy of the OS caching algorithm. That sounds like a bad plan for a benchmark. You are not measuring what you think you are (your code), you are measuring the OS caching policy.
Dumb question. With io_uring, how do you handle lines that straddle between chunks? I'm asking since, AFAIU, the submitted requests are not guaranteed to be completed in order.
(The easiest I can think of is submitting reads for "overlapped" chunks, I'm not sure there is an easier way and I'm not sure of how much performance overhead there is to it.)
What is the downside of memory mapping in this scenario? Shouldn't the page table properly handle the case of doing a single sequential read over a range of pages? Accessing the contents of a file doesn't seem like something caching would matter for. Do you mean that reading of sequential pages will keep adding to the cache compared to reading from a single page? That seems like a similar thing as before where they will be the first things gone from the cache anyways.
Caching and paging almost always matter, even on things like this. The core problem is that the filesystem won't prefetch for you, and you will be waiting to page fault several times over the length of the file. Another problem of the size of the working set is that you will be seeing several slow calls to (essentially) malloc in the kernel to hold all of that data, while using a small, preallocated structure will give you none of that trouble.