> "Our preliminary work has traced the outage to a damaged database file." > Spe...

tigerBL00D · on Jan 12, 2023

What kind of database are they using, I wonder, to end up with such a spectacular failure?

lopkeny12ko · on Jan 12, 2023

Why would this be an indictment of any specific database technology? If your disk fails and corrupts the filesystem, you're toast, regardless of what database you are using.

mdavidn · on Jan 12, 2023

The technology to detect and recover from disk failures does exist. RAID and ZFS, for example.

I would not expect a disk failure to replicate to the backup.

ct520 · on Jan 12, 2023

Working with critical infrastructure and lack of true in depth oversight it wouldn’t surprise me DR plans were not ever executed or exercised in a meaningful manner.

ChrisMarshallNY · on Jan 12, 2023

This is quite common.

Comprehensive DR testing is really difficult. Many orgs settle for “on paper,” or “in theory” substitutions for real testing.

They do it right; no problem.

Doing it right, though … there’s the rub …

tanelpoder · on Jan 12, 2023

Yep and if you ship WAL transaction logs to standby databases/replicas, corrupt blocks or lost writes in the primary database won't be propagated to the standbys (unlike with OS filesystem or storage-level replication).

Edit: Should add "won't be silently propagated"

ilyt · on Jan 12, 2023

Neither checks the checksum on every read as that would be performance-prohibitive. So "bad data on drive -> db does something with corrupted data and saves corrupted transformation back to disk" is very much possible, just extremely unlikely.

But they said nothing about it being bad drive, just corrupted data file, which very well might be software bug or operator error

sigotirandolas · on Jan 12, 2023

This is wrong, both ZFS and btrfs verify the checksum on every read.

It's not typically a performance concern because computing checksums is fast on modern hardware. Besides, historically IO was much slower than CPU.

guenthert · on Jan 12, 2023

> Neither checks the checksum on every read as that would be performance-prohibitive.

It is expensive. It might be prohibitive in a very competitive environment. This is hardly the case here. Safety first!

ExoticPearTree · on Jan 14, 2023

RAID does not really protect you from bit rot that tends to happen from time to time. ZFS might because it checksums the blocks. But if the corruption happens in memory and then it is transferred to disk and replicated, then from a disk perspective the data was valid.

techie128 · on Jan 12, 2023

> If your disk fails and corrupts the filesystem, you're toast, regardless of what database you are using.

There are databases that maintain redundant copies and can tolerate disk / replica failure. e.g. Cassandra.

efficax · on Jan 12, 2023

journal databases are specifically designed to avoid catastrophic corruption in the event of disk failure. the corrupt pages should be detected and reported by the database will function fine without them

redox99 · on Jan 12, 2023

If you mean journaling file systems, no. They prevent data corruption in the case of system crash or power outage.

That's different from filesystems that do checksumming (zfs, btrfs). Those can detect corruption.

In any case, if you use a database it handles these things by itself (see ACID). However I don't believe they can necessarily detect disk corruption in all cases (like checksumming file systems).

rini17 · on Jan 12, 2023

We had Oracle corrupt itself due to software bug. It similarly went undetected for some time and thus ended in backups.

eurasiantiger · on Jan 12, 2023

Well, for example, MySQL/MariaDB using utf8 tables will instantly go down if someone inserts a single multibyte emoji character, and the only way out is to recreate all tables as utf8mb4 and reimport all data.

colinjoy · on Jan 12, 2023

Surely nobody would use that format and allow a commit message including emojis to cause an effective DOS for a large Sonarqube project.

NavinF · on Jan 12, 2023

It doesn't block inserts with invalid data? I thought that was the whole point of telling the database what types you're using

dpcx · on Jan 12, 2023

MySQL historically isn't very good about blocking bad data. Sometimes it would silently truncate strings to fit the column type, for example. It's getting better as time goes on, though.

ilyt · on Jan 12, 2023

It does and poster above is incompetent

eurasiantiger · on Jan 12, 2023

I have had customer production sites go down due to this issue when emojis first arrived. It was a common issue in 2015. I would hope it is fixed by now!

lsaferite · on Jan 12, 2023

Having dealt with utf8mb4 data being inserted into the utf8mb3 columns many many times in the past, I've never had a table "instantly go down". You either get silent truncation or a refusal to insert the data.

eurasiantiger · on Jan 12, 2023

Well, your applications haven’t used a serialized or JSON column. That’s how you go from truncation to downtime.

That said, I do remember this being an issue even with plain text.

dolmen · on Jan 12, 2023

I need more info about this.

lsaferite · on Jan 12, 2023

In MySQL the `utf8` character set is originally an alias for `utf8mb3`. The alias is deprecated as of 8.0 and will eventually be switched to mean `utf8mb4` instead. The `utf8mb3` charset means it's UTF8 encoded data, but only supports up to 3 bytes per character, instead of the full 4 bytes needed.

https://en.wikipedia.org/wiki/UTF-8#MySQL_utf8mb3

birdyrooster · on Jan 12, 2023

Imagine you have one node which is running as a replica of another and it takes the backups. Well, let’s pretend it is backing up the corrupted data once in a while and it happened to overwrite their cold backup. They could have any number of databases and still had this failure. It’s more their methodology for taking backups. They should have many points in time to choose from to rebuild their database. They should be testing their databases before backing them up blindly.

tenken · on Jan 12, 2023

> They should be testing their databases before backing them up blindly.

Oh you mean they should be testing/validating the generated backup db file before replicating it to long-term archive ...

readthenotes1 · on Jan 12, 2023

Way back when use cases were a thing, I used to chide people for saying that Backup was a use case.

No, Restore is a use case.

(Replace "use case" with "requirement" or "user story"...)

samman · on Jan 12, 2023

A corollary to this would be: “Backups are worthless. Restores are priceless.”

birdyrooster · on Jan 13, 2023

semantics but yes

walrus01 · on Jan 12, 2023

Maybe Bobby Tables is getting into the NOTAM system now

https://xkcd.com/327/

capkutay · on Jan 12, 2023

It could also be corruption caused by log-based database replication