CREATE EXTENSION file_fdw;
CREATE SERVER stations FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE records (
station_name text,
temperature float
) SERVER stations OPTIONS (filename 'path/to/file.csv', format 'csv', delimiter ';');
SELECT station_name, MIN(temperature) AS temp_min, AVG(temperature) AS temp_mean, MAX(temperature) AS temp_max
FROM records
GROUP BY station_name
ORDER BY station_name;
Just modified the original post to add the file_fdw. Again, none of the instances (PG or ClickHouse) were optimised for the workload https://ftisiot.net/posts/1brows/
In the first one, the table is dropped, recreated, populated and queries.
In the second example, the table is created from a file FDW to the CSV file.
In both examples the loading time is included in the total time
Using FDWs on a daily basis I fully realize their power and appeal but at this exclamation I paused and thought - is this really how we think today? That reading a CSV file directly is a cool feature, state of the art? Sure, FDWs are much more than that, but I would assume we could achieve much more with Machine Learning, and not even just the current wave of LLMs.
Why not have the machine consider the data it is currently seeing (type, even actual values), think about what end-to-end operation is required, how often it needs to be repeated, make a time estimate (then verify the estimate, change it for the next run if needed, keep a history for future needs), choose one of methods it has at its disposal (index autogeneration, conversion of raw data, denormalization, efficient allocation of memory hierarchy, ...). Yeah, I'm not focusing on this specific one billion rows challenge but rather what computers today should be able to do for us.