Disaster: MySQL 5.5 Flushing

seldo · on Sept 18, 2011

TLDR: on big machines with lots of memory (72GB), vanilla MySQL has great performance but suffers from minutes-long "stalls" during which throughput drops to zero. The only viable way the author found to avoid the stalls is to decrease throughput by 20x by cutting the amount of memory MySQL uses (thus wasting the hardware).

This is really terrible news for anyone trying to do high-throughput DB operations on mySQL.

The only "good" news is that the average web app isn't going to be running on hardware anywhere near as powerful as this, so you are highly likely to already be in the "slower", consistent state (e.g. anything less than 40GB of RAM).

dmk23 · on Sept 18, 2011

Not really, if you are using hardware of this size chances are you are hosting it yourself and can install SSDs.

Solid State Drives is the easiest solution to most problems requiring high I/O throughput and data persistence.

Too bad most cloud hosting vendors are way behind on offering anything like that.

EDIT: Not sure why this comment got down voted, but perhaps whoever did it might comment on why using SSD is not a solution.

zokier · on Sept 19, 2011

Using SSD is kinda "throw more hardware at the problem so that we don't need to fix our precious crappy software" type of solution.

rickmb · on Sept 19, 2011

Although abhorred by software makers such as most HN-readers, throwing hardware at the problem is often the fastest and cheapest way to tackle performance issues.

Especially when it indeed is "crappy software" that will take major effort to rewrite/refactor.

Hardware is cheaper than developers, and the former doesn't fail to deliver half as often as the latter...

fauigerzigerk · on Sept 19, 2011

The problem with this principle is that adding hardware tends to scale linearily at best whereas improving an algorithm sometimes buys you an orders of magnitude improvement.

Also, you need to compare the cost of a single developer who fixes innodb to the cost of hardware incurred by _all_ users of innodb who would benefit from the solution.

dmk23 · on Sept 19, 2011

Adding SSD is not linear scaling. SSD has entirely different characteristics from rotational drives which in itself brings 1-3 orders of magnitude I/O performance improvement.

Trying to fit high I/O workloads on spinning disks is a lot like spending your efforts to make your program run on 64K RAM. An interesting challenge, that would surely exercise your coding skills, but would not deliver the optimal performance.

fauigerzigerk · on Sept 19, 2011

I'm not denying the benefits of SSD. I just don't think throwing hardware at a problem caused by bad algorithms is a good general principle. It may still be the best choice in a particular situation and timeframe.

dmk23 · on Sept 19, 2011

The only thing that really matters is overall system performance, which means software / hardware combination.

It is hard to see the benefit of ignoring SSD to get high I/O performance vs. redesigning the software around I/O bottlenecks.

Redesigning software to work on bad hardware is definitely a challenging project, but not always the best use of resources.

lwat · on Sept 19, 2011

Nothing wrong with throwing hardware at a problem like this. In most cases it's much less painful than changing to a new RDBMS.

ntoshev · on Sept 18, 2011

Amazon EC2 offers 68GB RAM and it's hardly an obscure offering. But they don't offer SSDs.

SSD might be a solution, but we don't have any proof either way, and we also don't understand why is this problem happening.

spudlyo · on Sept 18, 2011

We do understand why this problem is happening. The algorithm that takes the nice fast (but limited in size) sequential log writes and tries to intelligently write them (mostly randomly) in the background to the data/index blocks where they live can't keep up, and eventually stalls out as the amount of free log space dwindles.

SSDs really help this process happen faster because they have way better random write performance than hard disks.

dmk23 · on Sept 18, 2011

Quoting from the article:

=============================

What to do?

I wish I could say you should use Percona Server with XtraDB.

If we were using SSD as storage, then I would recommend it. Vanilla MySQL performs equally bad on SSD and HDD, while for SSD in Percona Server we have “innodb_adaptive_flushing_method = keep_average”.

Unfortunately on spinning disks (as in this case), Percona Server may not show significant improvement. I am going to followup with results on Percona Server.

=============================

SSD is the solution because unlike rotational disks it allows for lots of random accesses. While they are too expensive to use as "archiving" storage, SSD excels as a layer between traditional disks and RAM. Put your high I/O workload on SSD and keep your historic data on spinning disks.

If you want to see some SSD benchmarks, here is your resource: http://www.ssdperformanceblog.com/

As far as cloud storage, yes, Amazon EC2 offers 68GB RAM, however if you need hardware of that size it is not hard to justify setting up your own servers. It would be cheaper too and you'll be able to fully optimize your storage.

alnayyir · on Sept 19, 2011

I get random writes on the order of 5-10 mb/s at best on a typical server we use for our databases.

My macbook pro I upgraded with a gaming grade SSD gets 100-200 mb/s random write.

Just upgrading to SSDs is more viable than hacking up a database with a huge community behind it that you didn't originally create yourself.

darklajid · on Sept 19, 2011

Maybe someone didn't like your comment because of this quote _from the article_:

    If we were using SSD as storage, then I would recommend it.
    Vanilla MySQL performs equally bad on SSD and HDD, 
    while for SSD in Percona Server we have
    
    innodb_adaptive_flushing_method = keep_average.

Doesn't seem like a solution at all, if we believe the author.

dmk23 · on Sept 19, 2011

Of course, you'd have to upgrade from vanilla MySQL to Percona to take advantage of this option.

Given that Percona is fully backwards compatible with MySQL, while adding this and many other tuning options, I just cannot see a reason to use vanilla MySQL ever.

mmaunder · on Sept 19, 2011

This is awesome data as usual from Vadim and team. I always thought the upper bound of InnoDB performance was limited by memory and how much you can afford. Now I know different.

I wanted to share some of our data to help in your decision making:

A typical server in our cluster gets 5000 to 7500 TPS:

http://i.imgur.com/oT26G.png

This is a 24 hour graph and as you can see there are no multi-minute lockups. We've been extremely happy with the performance our servers are delivering.

The config is 8GB of memory, RAID 1, 2X 15K RPM disks, single Intel E5410 quad core CPU.

Total DB size is around 12GB with around 15 million rows. Our TPS is higher than Vadim's but our data is smaller than the benchmark he uses which is 200 million rows in a 58 GB database.

With the following for innodb:

  innodb_file_per_table = 1
  innodb_flush_log_at_trx_commit = 0
  innodb_buffer_pool_size = 5G
  innodb_additional_mem_pool_size = 20M
  innodb_log_file_size = 1024M
  innodb_log_buffer_size = 16M
  innodb_max_dirty_pages_pct = 90
  innodb_thread_concurrency = 4
  innodb_log_group_home_dir = /var/lib/mysql/
  innodb_log_files_in_group = 2

So my approach to avoiding this issue will be to shard to finer granularity and avoid architectures with monolithic DB's. Keep in mind that this only affects your ability to cache greater than around 16GB, so spreading massive data across multiple servers with fast drives and moderate memory will help bring your TPS back up. I would also add that servers with memory > 32GB are very expensive - especially the memory itself. So it may not cost much more to have 5 servers with fast disk and moderate memory vs one server with massive memory.

Having said all that this is obviously a very serious problem and perhaps an opportunity for a talented computer scientist to branch InnoDB and solve it if Oracle doesn't want to.

spudlyo · on Sept 19, 2011

innodb_flush_log_at_trx_commit = 0

This is cheating, you've effectively disabled the 'D' in ACID; you aren't doing an fsync() at every commit.

mmaunder · on Sept 19, 2011

Cheating what? As I said, it's just a typical server in our cluster and hopefully helpful data.

robryan · on Sept 19, 2011

A value of 1 is required for ACID compliance. You can achieve better performance by setting the value different from 1, but then you can lose at most one second worth of transactions in a crash. With a value of 0, any mysqld process crash can erase the last second of transactions. With a value of 2, then only an operating system crash or a power outage can erase the last second of transactions. However, InnoDB's crash recovery is not affected and thus crash recovery does work regardless of the value.

Just depends whether or not you can deal with a small amount of lost data in the event of a crash.

sgt · on Sept 18, 2011

I've switched to PostgreSQL several years ago. Bullet proof, exactly the features I need (and more), and no problems like these. Speaking of, I'm about to upgrade my installations to 9.1 now.

jbellis · on Sept 19, 2011

As much as I like pg, I don't think there's anything magical it can do to fix this when the database is given a write-heavy workload such that the [sequential] commitlog appends outstrip the ability to do the [random] b-tree updates. Which is a pretty easy situation to set up on modern hardware, really.

When faced with this situation it seems to me you have three choices at a high level:

- throttle writes closely to your random i/o limit, eliminating your ability to accomodate brief write spikes

- throttle writes less aggressively, accommodating brief spikes but forcing stalls under sustained load as seen here

- don't throttle at all and postpone catching up until commitlog replay on restart, if necessary

None is without downside.

sc68cal · on Sept 18, 2011

Are you running an equivalent hardware setup? Similar to:

>Hardware: HP ProLiant DL380 G6, with 72GB of RAM and RAID10 on 8 disks.

Playing devil's advocate - there is no proof that PostgreSQL does not have any teething problems on high capacity hardware either.

bretthoerner · on Sept 18, 2011

> Are you running an equivalent hardware setup? > there is no proof that PostgreSQL does not have any teething problems on high capacity hardware either

At Disqus we run Postgres on even larger boxes without pauses described here. We would have (had to) migrate off long ago if this were par for the course.

gleb · on Sept 18, 2011

There is similar problem in Postgres as well.

Not nearly as bad in magnitude, and with some shared responsibility with Linux and file system code (since PG relies of filesystem's caching). Greg Smith talked about it in some detail during PGWest 2010. One of his customers had long stalls in production and Greg worked out how to reproduce and explain them.

He had some patches he was planning for Postgres 9.1, not sure if they went in.

spudlyo · on Sept 18, 2011

Is your database extremely write heavy? You don't usually see this problem in a more balanced read/write workload. Postgres has in the past suffered from stalls and lockups that can happen at checkpoint intervals, but it appears they mostly happen for seconds rather than minutes.

sciurus · on Sept 19, 2011

The answer is that yes, Postgres can have a similar problem.

http://notemagnet.blogspot.com/2008/08/linux-write-cache-mys...

moe · on Sept 18, 2011

That's not "high capacity", it's a mid-range box.

The mentioned spec can be had for roughly $7000 USD, depending on CPU.

pointyhat · on Sept 18, 2011

The issue is x86/x64 being a turd. Throughput of just about everything drops when you go over the 36Gb boundary for some reason. I have no idea why. It can drop by as much as 30%. This is from experience operating VMware on the physically the same type of kit (except with FC SAN / NetApp).

I'd personally rather buy high end SPARC64 kit which has linear performance scalability but we all know what happened to Sun, plus it doesn't run Windows anyway :( whimpers a little

jeremyw · on Sept 18, 2011

Be careful of your NUMA boundaries. 20-30% is exactly the cross-bank access cost.

spudlyo · on Sept 18, 2011

That's what I thought too. There are a lot of weird problems that can happen due to NUMA architecture issues. Reading the numactl for me really drives home just how much I don't understand it.

http://www.linuxmanpages.com/man8/numactl.8.php

pointyhat · on Sept 18, 2011

That's what I was thinking. Need some analysis on it really. We throw money at the problem at the moment and it goes away :)

zwischenzug · on Sept 18, 2011

I've seen DBs run on 128G machines on x6 and have no experience of such problems. Do you have a link or hard evidence of this?

I have seen throughput issues on VMWare machines (and false reporting on standard linux utilities, eg "time" output and various other issues that make me distrust it for performance-sensitive software), but don't know if this is to do with 128G machines; usually it's to do with contention on one network card handling all traffic to disks.

trafficlight · on Sept 18, 2011

I haven't heard/experienced the 36 GB thing since I don't run anything that big right now. Are you aware of any websites or blog posts that talk about it?

pointyhat · on Sept 19, 2011

Some bits buried in here: http://frankdenneman.nl/2010/02/sizing-vms-and-numa-nodes/

schiptsov · on Sept 19, 2011

Yet another self-promotion article. Oh, we stuff data in memory and then when it is time to write changes on disk everything got frozen! MySQL suck, buy our solution! Nice guys.)

First, it is useless to store data in memory if you want them be committed into disk storage. The general idea here isn't about switching to some new version of mysql or SSD disks, it is about to realize that you have a data-flow inadequate for your one-server architecture.

Second - check-point intervals should be adjusted to your actual data-flow, which means they should be executed often enough. If there is situation of almost constant checkpoint - non-stop data writes, that is the sign that you need to consider sharding/multi-server solution.

The hints that there must be no other disk activity on the same hardware volume or any swapping in OS, I suppose, are obvious. People who have a /var/log and /var/db on the same volume are idiots.

There are also good idea to use one file per table storage and put a data and physical logs on a separate hardware volumes (links are your friends). One raid-X volume that fits all is a quite naive solution. Raid isn't a guaranty of reliability. Replications to a back-up servers are.

Third, when you test your configuration before put it into production, you should tune-up your servers to perform with data and log syncing, and then figure out appropriate buffer sizes and checkpoint intervals. Then, in production, when you're experiencing an increasing flow of queries, you may choice to switch into different syncing strategy and/or more often but a little bit faster checkpoints.

Configuring mysql with huge buffers and no sync means lack of understanding the basic concepts, self-delusion and misuse of software and hardware. ^_^

morgo · on Sept 19, 2011

"Yet another self-promotion article. Oh, we stuff data in memory and then when it is time to write changes on disk everything got frozen! MySQL suck, buy our solution! Nice guys.)"

You got this part all wrong. He explicitly said that 'their solution' is unlikely to do any better. This is a real problem.

schiptsov · on Sept 19, 2011

What to do? I whish I could say you should use Percona Server with XtraDB. If we were using SSD as storage, then I would recommend it.

teoruiz · on Sept 18, 2011

How is this different in MySQL 5.1?

From the article I get the impression that this is a new problem introduced with 5.5.

spudlyo · on Sept 18, 2011

It's not. The adaptive flushing algorithm has improved over the various versions, but checkpointing has been the cause of performance stalls for a long time. Here is an early bug report of the problem from 2006.

http://bugs.mysql.com/bug.php?id=19673

moe · on Sept 18, 2011

Why does he use flush_method = O_DIRECT?

Last time I checked that was a recipe for terrible performance. We normally leave the flush_method set to default and keep a couple slaves around to minimize the potential for data loss.

morgo · on Sept 19, 2011

Common myth. Real answer: it depends. I recently had a bug approved to update the manual: http://bugs.mysql.com/bug.php?id=54306

spudlyo · on Sept 18, 2011

It gives you better performance provided:

* You're not talking to a SAN or a virtual disk device.

* You use a filesystem that allows concurrent updates to O_DIRECT opened files, like XFS.

Andys · on Sept 18, 2011

I can't see what was wrong with the second graph. Regular bursts of writes for flushing leaves the disks more idle for low-latency reads. Optimising throughput doesn't mean you have to have a smooth, flat write graph.

(For comparison, see ZFS)

zokier · on Sept 19, 2011

I wonder if sharding in some clever way could alleviate the problem. Some way of having another instance processing queries while other one is flushing logs.

_4vyi · on Sept 18, 2011

> Disaster: MySQL 5.5 Flushing

> Disaster: MySQL 5.5

> MySQL

I found your problem right here.

mrinterweb · on Sept 19, 2011

I have personally noticed MySQL having horrible performance for views that use joins. Every time a view does a join, it creates a temporary table and you might as well throw indexing out once that happens. I had a query that was taking 4 seconds with a view and used the same sql minus the view and got it down to 5ms. I know I am not being specific to flushing, but still it is an example of a potential performance problem.

rhizome · on Sept 19, 2011

That sounds like a bad data model and/or bad MySQL usage to me.

mrinterweb · on Sept 19, 2011

I recently did learn about the difference between the merge and temptable algorithm approaches a MySQL view can take. I was using the default temptable approach and did not know about the merge algorithm that can be specified for views. By my experience with MySQL's views, MySQL does not do much to optimize execution plans for views.

IgorPartola · on Sept 19, 2011

By default MySQL will use the merge algorithm. It will create a temp table if the result of the view is large. You can control what "large" means. That is the better method of controlling the behavior, than potentially telling MySQL that it jeeds to create a multi GB result set but must keep it all in RAM. Read the manual or the O'Reilly book for more info.

Anither option is to use materialized views, which are not natively supported in MySQL, but can easily be simulated.

jeltz · on Sept 19, 2011

I do not see why views need to be so complex in MySQL. Why not do like in PostgreSQL where they are almost just subqueries saved within the database.

IgorPartola · on Sept 19, 2011

That is essentially what they are. The merge algorithm merges your query with the query of the view and returns the result. However, if the query of the view returns a very large result set, it is much faster to pre-flight it, then select against it.

jeltz · on Sept 19, 2011

Yeah, but letting users specify if views are MERGE or TEMPTABLE seems quite pointless. Why not always use UNDEFINED and let the planner chose. The planner knows which tables are large.