A bit of curiosity: how did Postgres choose 8k pages? shouldn’t it be the FS pag...

silvestrov · 2024-08-05T11:34:23.000000Z

Depends very much on how the SSDs are designed internally. I think these days we have to settle for "can never be sure" of the real page size for atomicity. Pages can also become corrupt in other ways.

It is weird that "--data-checksums" isn't the default for new databases, even when it cost a bit in performance. Integrity should be more important than performance.

isosphere · 2024-08-05T11:53:28.000000Z

Was thinking the same thing when I saw those zeros in the checksum field. Perhaps the consequences are significant.

Here's a benchmarking exercise I found: https://www-staging.commandprompt.com/uploads/images/Command...

With a tidy summary:

> Any application with a high shared buffers hit ratio: little difference. > Any application with a high ratio of reads/writes: little difference. > Data logging application with a low ratio of reads/inserts, and few updates and deletes: little difference. > Application with an equal ratio of reads/inserts, or many updates or deletes, and a low shared buffers hit ratio (for example, an ETL workload), especially where the rows are scattered among disk pages: expect double or greater CPU and disk I/O use. > Run pg_dump on a database where all rows have already been previously selected by applications: little difference. > Run pg_dump on a database with large quantities of rows inserted to insert-only tables: expect roughly double CPU and disk I/O use.

silvestrov · 2024-08-05T12:35:24.000000Z

On my M1 mac "dd ... | cksum" takes 3 seconds while "dd | shasum" (sha1) takes 2 seconds. So cksum might not be the best tool for performance checking.

There is CPU specific code in the PG source in src/include/storage/checksum_impl.h

It is written as a plain nested loop in C. So performance is fully dependent on the compiler being able to parallelize or vectorize the code.

I would not be surprised if manually written SIMD code would be faster.

anarazel · 2024-08-05T15:12:00.000000Z

The bottleneck isn't at all the checksum computation itself. It's that to keep checksums valid we need to protect against the potential of torn pages even in cases where it doesn't matter without checksums (i.e. were just individual bits are flipped). That in turn means we need to WAL log changes we don't need to without checksums - which can be painful.

Joe_Cool · 2024-08-05T15:00:28.000000Z

Interesting. I guess M1 doesn't have the 'crc32' "acceleration" that is included in SSE4.2.

loeg · 2024-08-05T16:20:48.000000Z

M1 has CRC32 acceleration intrinsics.

https://dougallj.wordpress.com/2022/05/22/faster-crc32-on-th...

https://github.com/corsix/fast-crc32?tab=readme-ov-file#appl...

silvestrov · 2024-08-05T21:28:52.000000Z

So when using these intrinsics an Intel Core i7 can do 30 GB/s but the performance check linked above (by isosphere ) says only 300 MB/s, i.e. 1%

Something is amiss here.

If a CPU can do 30 GB/s then a CRC check should not have any real performance impact.

loeg · 2024-08-06T02:24:40.000000Z

I don't know where you're getting 300 MB/s from.

silvestrov · 2024-08-06T09:21:48.000000Z

Page 5 of https://www-staging.commandprompt.com/uploads/images/Command... says "This system can checksum data at about 300 MB/s per core."

It lacks page numbers. Page 5 is first page with gray box at the top of the page.

loeg · 2024-08-06T19:46:57.000000Z

That's measuring 'cksum', which must have an awfully slow implementation. The document notes that this is distinct from measuring PG's checksum performance. (I think it's a pretty useless measurement.)

Earlier (page 4):

> How much CPU time does it take to checksum...

> ...a specific amount of data? This is easy to estimate because PostgreSQL uses the crc32 algorithm which is very simple, and (GNU) Linux has a command line program that does the same thing: cksum.

Yeah, using cksum as an estimate here appears to be very flawed.

Joe_Cool · 2024-08-07T14:59:48.000000Z

That is weird. Seems like crc optimization is quite a rabbit hole.

https://github.com/komrad36/CRC has a massive section about it in the README. Really interesting.

loeg · 2024-08-08T04:45:29.000000Z

Yeah. crc32 may be simple in theory, but doing it as fast as possible utilizing the various execution units of modern hardware is challenging.

tremon · 2024-08-05T13:38:37.000000Z

I'm not sure how much bearing internal storage organization should have on Postgres' page size. Since pg explicitly chooses not to implement their own storage organization layer, there's always a filesystem between a pg database and the underlying storage.

berkes · 2024-08-05T16:12:10.000000Z

> Integrity should be more important than performance.

Most often it is. But not always. There certainly are cases where speed is far more important than integrity in databases. I cannot think of a case where this would be true for a RDBMS or even a Document DB (Though MongoDB had different opinions on this...).

But e.g. redis as caching server, or memcached, or even these non-normalized data that I have in a PG that can be reproduced from other sources easily in case of corruption or stale-ness: it's fine to trade in integrity for speed there.

speed_spread · 2024-08-05T12:01:44.000000Z

If physical integrity is already provided by the backing filesystem, such as ZFS, wouldn't --data-checksums be redundant?

magicalhippo · 2024-08-05T13:45:51.000000Z

If data is served from ARC, it's primary cache, ZFS does not perform a checksum check before handing it to you, as the data was checked when it got read into the ARC.

If you use ECC you're quite safe, but ECC can't detect multi-bit errors, just single and double bit errors.

So if you care much about your integrity, you might want Postgres to do its checksum check as well.

ikiris · 2024-08-05T15:12:34.000000Z

If you somehow have 3+ bit errors coming out of ram on an ECC board, you have much bigger problems than trying to verify your postgres data via checksum.

magicalhippo · 2024-08-05T16:24:14.000000Z

AFAIK RowHammer attacks can cause multiple bits to flip in a row[1], no?

But sure, it's not for the vast majority of folks.

[1]: https://www.vusec.net/projects/eccploit/

ikiris · 2024-08-05T20:16:07.000000Z

Rowhammer would qualify as a bigger problem, yes.

loeg · 2024-08-05T16:17:49.000000Z

The 8k size long predates SSDs or 4k sectors.

da_chicken · 2024-08-05T12:04:59.000000Z

8k is a very common page size, but 4k isn't unheard of. Oracle's default is 4k.

The issue is that page size caps row size (for on-row storage). Also, if you have a smart clustering index, larger pages can be more efficient use of index addressing. So it's a trade-off.

hans_castorp · 2024-08-05T13:39:11.000000Z

Oracle defaults to 8k as well:

https://docs.oracle.com/en/database/oracle/oracle-database/1...

> Default value 8192

anarazel · 2024-08-05T15:08:16.000000Z

Historically there was no atomicity at 4k boundaries, just at 512 byte boundaries (sectors). That'd have been too limiting. Lowering the limit now would prove problematic due to the smaller row sizes/ lower number of columns.

cedws · 2024-08-06T00:17:57.000000Z

I guess it's a tradeoff. Ideally you'd want a small enough page size such that pages are unlikely to be split across multiple LBAs, but the performance wouldn't be good. Standard filesystems don't really guarantee LBA alignment anyway.