Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd like to see it speed tested against an instance of Postgres using the file Foreign Data Wrapper https://www.postgresql.org/docs/current/file-fdw.html

    CREATE EXTENSION file_fdw;
    CREATE SERVER stations FOREIGN DATA WRAPPER file_fdw;
    CREATE FOREIGN TABLE records (
      station_name text,
      temperature float
    ) SERVER stations OPTIONS (filename 'path/to/file.csv', format 'csv', delimiter ';');
    SELECT station_name, MIN(temperature) AS temp_min, AVG(temperature) AS temp_mean, MAX(temperature) AS temp_max
    FROM records
    GROUP BY station_name
    ORDER BY station_name;



Just modified the original post to add the file_fdw. Again, none of the instances (PG or ClickHouse) were optimised for the workload https://ftisiot.net/posts/1brows/


>The result is 44.465s!

Very nice! Is this time for the first run or second? Is there a big difference between the first and second run, please?


first run! I just rerun the experiment: - ~46 secs on the first run - ~22 secs on the following runs


Not including this in the benchmark time is cheating:

    \copy TEST(CITY, TEMPERATURE) FROM 'measurements.txt' DELIMITER ';' CSV;


The loading time it's included in both examples

In the first one, the table is dropped, recreated, populated and queries. In the second example, the table is created from a file FDW to the CSV file. In both examples the loading time is included in the total time


For some reason I'm impressed that the previous page in the documentation is https://www.postgresql.org/docs/current/earthdistance.html. Lotta fun stuff in Postgres these days.


Man, Postgres is so cool and powerful.


Using FDWs on a daily basis I fully realize their power and appeal but at this exclamation I paused and thought - is this really how we think today? That reading a CSV file directly is a cool feature, state of the art? Sure, FDWs are much more than that, but I would assume we could achieve much more with Machine Learning, and not even just the current wave of LLMs.

Why not have the machine consider the data it is currently seeing (type, even actual values), think about what end-to-end operation is required, how often it needs to be repeated, make a time estimate (then verify the estimate, change it for the next run if needed, keep a history for future needs), choose one of methods it has at its disposal (index autogeneration, conversion of raw data, denormalization, efficient allocation of memory hierarchy, ...). Yeah, I'm not focusing on this specific one billion rows challenge but rather what computers today should be able to do for us.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: