> wow! awesome tips. I knew COPY rocks but didn't realize it would win vs INSERT INTO SELECT FROM !
We (postgres) should fix that at some point... The difference basically is that there's a dedicated path to insert many tuples at once that's often used by COPY that isn't used by INSERT INTO ... SELECT. The logic for determining when that optimization is correct (consider e.g. after-insert per-row triggers, the trigger invocation for row N may not yet see row N+1) is specific to COPY right now. We need to generalize it to be usable in more places.
To be fair, part of the reason the COPY approach is faster is that the generate_series() query actually uses a fair bit of CPU on its own, and the piped psql's lead to the data generation and data loading being run separately. Of course, partially paying for that by needing to serialize/deserialize the data and handling all the data in four processes.
When doing the COPYs separately to/from a file, it actually takes longer to generate the data than loading the data into an unlogged table.
# COPY (SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp FROM (SELECT generate_series(1,1e8::int8))) TO '/tmp/data.pgcopy' WITH BINARY;
COPY 100000000
Time: 21560.956 ms (00:21.561)
# BEGIN;DROP TABLE IF EXISTS temps_int2; CREATE UNLOGGED TABLE temps_int2 (city int2 NOT NULL, temp int2 NOT NULL); COPY temps_int2 FROM '/tmp/data.pgcopy' WITH BINARY;COMMIT;
BEGIN
Time: 0.128 ms
DROP TABLE
Time: 0.752 ms
CREATE TABLE
Time: 0.609 ms
COPY 100000000
Time: 18874.010 ms (00:18.874)
COMMIT
Time: 229.650 ms
Loading into a logged table is a bit slower, at 20250.835 ms.
> Of course, if we really want to "go fast" then we want parallel loading, which means firing up N postgresql backends and each generate and write the data concurrently to different tables, then each compute a partial summary in N summary tables, and finally merge them together.
With PG >= 16, you need a fair bit of concurrency to hit bottlenecks due to multiple backends loading data into the same table with COPY. On my ~4 year old workstation I reach over 3GB/s, with a bit more work we can get higher. Before that the limit was a lot lower.
If I use large enough shared buffers so that IO does not become a bottleneck, I can load the 1e9 rows fairly quickly in parallel, using pgbench:
c=20; psql -Xq -c "COPY (SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp FROM (SELECT generate_series(1,1e6::int8))) TO '/tmp/data-1e6.pgcopy' WITH BINARY;" -c 'DROP TABLE IF EXISTS temps_int2; CREATE UNLOGGED TABLE temps_int2 (city int2 NOT NULL, temp int2 NOT NULL); ' && time pgbench -c$c -j$c -n -f <( echo "COPY temps_int2 FROM '/tmp/data-1e6.pgcopy' WITH BINARY;" ) -t $((1000/${c})) -P1
real 0m26.486s
That's just 1.2GB/s, because the bottleneck is the per-row and per-field processing, due to their narrowness.
> The ultimate would be to hack into postgres and skip everything and just write the actual filesystem files in-place using knowledge of the file formats, then "wire in" these files to the database system tables. This normally gets hairy (e.g. TOAST) but with a simple table like this, it might be possible. This is a project I've always wanted to try.
I doubt that will ever be a good idea. For one, the row metadata contain transactional information, that'd be hard to create correctly outside of postgres. It'd also be too easy to cause issues with corrupted data.
However, there's a lot we could do to speed up data loading performance further. The parsing that COPY does, uhm, show signs of iterative development over decades. Absurdly enough, that's where the bottleneck most commonly is right now. I'm reasonably confident that there's at least 3-4x possible without going to particularly extreme lengths. I think there's also at least a not-too-hard 2x for the portion of loading loading data into the table.
thx! Indeed, upon reading the source and sleeping on it, I agree, and in fact it looks like a single-user postgres backend with COPY FROM <filename> BINARY is approximately the same architecture as writing the database files in-place, and of course includes support for default values, triggers, constraints, TOAST and more.
I've reproduced the speed difference between pg COPY vs various cases.
Some results (middle result of 3 stable runs) from 1.4GB BINARY dump:
echo "drop table if exists tbl; create table tbl(city int2, temp int2);copy tbl FROM '/citydata.bin' binary;" | ./pg/bin/postgres --single -D tmp -p 9999 postgres;
real 0m34.508s
# switching to unlogged table
real 0m30.620s
# hardcoding heap_multi_insert() to be a NOOP (return early if ntuples>100)
# fyi, heap_multi_insert() gets called with n=1000 tuples per call
real 0m11.276s
# hardcoding skip_tuple = true in src/backend/commands/copyfrom.c:1142
real 0m6.894s
# after testing various things
time sh -c "tar cf - citydata.bin | (cd /tmp; tar xf -)"
real 0m2.811s
Note: I tried increasing the blocksize (--with-blocksize) and also MAX_BUFFERED_TUPLES (copyfrom.c:65) but as expected they didn't help, I guess n=1000 tuples amortizes the overhead.
We (postgres) should fix that at some point... The difference basically is that there's a dedicated path to insert many tuples at once that's often used by COPY that isn't used by INSERT INTO ... SELECT. The logic for determining when that optimization is correct (consider e.g. after-insert per-row triggers, the trigger invocation for row N may not yet see row N+1) is specific to COPY right now. We need to generalize it to be usable in more places.
To be fair, part of the reason the COPY approach is faster is that the generate_series() query actually uses a fair bit of CPU on its own, and the piped psql's lead to the data generation and data loading being run separately. Of course, partially paying for that by needing to serialize/deserialize the data and handling all the data in four processes.
When doing the COPYs separately to/from a file, it actually takes longer to generate the data than loading the data into an unlogged table.
Loading into a logged table is a bit slower, at 20250.835 ms.> Of course, if we really want to "go fast" then we want parallel loading, which means firing up N postgresql backends and each generate and write the data concurrently to different tables, then each compute a partial summary in N summary tables, and finally merge them together.
With PG >= 16, you need a fair bit of concurrency to hit bottlenecks due to multiple backends loading data into the same table with COPY. On my ~4 year old workstation I reach over 3GB/s, with a bit more work we can get higher. Before that the limit was a lot lower.
If I use large enough shared buffers so that IO does not become a bottleneck, I can load the 1e9 rows fairly quickly in parallel, using pgbench:
That's just 1.2GB/s, because the bottleneck is the per-row and per-field processing, due to their narrowness.> The ultimate would be to hack into postgres and skip everything and just write the actual filesystem files in-place using knowledge of the file formats, then "wire in" these files to the database system tables. This normally gets hairy (e.g. TOAST) but with a simple table like this, it might be possible. This is a project I've always wanted to try.
I doubt that will ever be a good idea. For one, the row metadata contain transactional information, that'd be hard to create correctly outside of postgres. It'd also be too easy to cause issues with corrupted data.
However, there's a lot we could do to speed up data loading performance further. The parsing that COPY does, uhm, show signs of iterative development over decades. Absurdly enough, that's where the bottleneck most commonly is right now. I'm reasonably confident that there's at least 3-4x possible without going to particularly extreme lengths. I think there's also at least a not-too-hard 2x for the portion of loading loading data into the table.