TLDR: on big machines with lots of memory (72GB), vanilla MySQL has great performance but suffers from minutes-long "stalls" during which throughput drops to zero. The only viable way the author found to avoid the stalls is to decrease throughput by 20x by cutting the amount of memory MySQL uses (thus wasting the hardware).
This is really terrible news for anyone trying to do high-throughput DB operations on mySQL.
The only "good" news is that the average web app isn't going to be running on hardware anywhere near as powerful as this, so you are highly likely to already be in the "slower", consistent state (e.g. anything less than 40GB of RAM).
Although abhorred by software makers such as most HN-readers, throwing hardware at the problem is often the fastest and cheapest way to tackle performance issues.
Especially when it indeed is "crappy software" that will take major effort to rewrite/refactor.
Hardware is cheaper than developers, and the former doesn't fail to deliver half as often as the latter...
The problem with this principle is that adding hardware tends to scale linearily at best whereas improving an algorithm sometimes buys you an orders of magnitude improvement.
Also, you need to compare the cost of a single developer who fixes innodb to the cost of hardware incurred by _all_ users of innodb who would benefit from the solution.
Adding SSD is not linear scaling. SSD has entirely different characteristics from rotational drives which in itself brings 1-3 orders of magnitude I/O performance improvement.
Trying to fit high I/O workloads on spinning disks is a lot like spending your efforts to make your program run on 64K RAM. An interesting challenge, that would surely exercise your coding skills, but would not deliver the optimal performance.
I'm not denying the benefits of SSD. I just don't think throwing hardware at a problem caused by bad algorithms is a good general principle. It may still be the best choice in a particular situation and timeframe.
We do understand why this problem is happening. The algorithm that takes the nice fast (but limited in size) sequential log writes and tries to intelligently write them (mostly randomly) in the background to the data/index blocks where they live can't keep up, and eventually stalls out as the amount of free log space dwindles.
SSDs really help this process happen faster because they have way better random write performance than hard disks.
I wish I could say you should use Percona Server with XtraDB.
If we were using SSD as storage, then I would recommend it. Vanilla MySQL performs equally bad on SSD and
HDD, while for SSD in Percona Server we have “innodb_adaptive_flushing_method = keep_average”.
Unfortunately on spinning disks (as in this case), Percona Server may not show significant improvement. I am going to followup with results on Percona Server.
=============================
SSD is the solution because unlike rotational disks it allows for lots of random accesses. While they are too expensive to use as "archiving" storage, SSD excels as a layer between traditional disks and RAM. Put your high I/O workload on SSD and keep your historic data on spinning disks.
As far as cloud storage, yes, Amazon EC2 offers 68GB RAM, however if you need hardware of that size it is not hard to justify setting up your own servers. It would be cheaper too and you'll be able to fully optimize your storage.
Maybe someone didn't like your comment because of this quote _from the article_:
If we were using SSD as storage, then I would recommend it.
Vanilla MySQL performs equally bad on SSD and HDD,
while for SSD in Percona Server we have
innodb_adaptive_flushing_method = keep_average.
Doesn't seem like a solution at all, if we believe the author.
Of course, you'd have to upgrade from vanilla MySQL to Percona to take advantage of this option.
Given that Percona is fully backwards compatible with MySQL, while adding this and many other tuning options, I just cannot see a reason to use vanilla MySQL ever.
This is awesome data as usual from Vadim and team. I always thought the upper bound of InnoDB performance was limited by memory and how much you can afford. Now I know different.
I wanted to share some of our data to help in your decision making:
A typical server in our cluster gets 5000 to 7500 TPS:
This is a 24 hour graph and as you can see there are no multi-minute lockups. We've been extremely happy with the performance our servers are delivering.
The config is 8GB of memory, RAID 1, 2X 15K RPM disks, single Intel E5410 quad core CPU.
Total DB size is around 12GB with around 15 million rows. Our TPS is higher than Vadim's but our data is smaller than the benchmark he uses which is 200 million rows in a 58 GB database.
So my approach to avoiding this issue will be to shard to finer granularity and avoid architectures with monolithic DB's. Keep in mind that this only affects your ability to cache greater than around 16GB, so spreading massive data across multiple servers with fast drives and moderate memory will help bring your TPS back up. I would also add that servers with memory > 32GB are very expensive - especially the memory itself. So it may not cost much more to have 5 servers with fast disk and moderate memory vs one server with massive memory.
Having said all that this is obviously a very serious problem and perhaps an opportunity for a talented computer scientist to branch InnoDB and solve it if Oracle doesn't want to.
A value of 1 is required for ACID compliance. You can achieve better performance by setting the value different from 1, but then you can lose at most one second worth of transactions in a crash. With a value of 0, any mysqld process crash can erase the last second of transactions. With a value of 2, then only an operating system crash or a power outage can erase the last second of transactions. However, InnoDB's crash recovery is not affected and thus crash recovery does work regardless of the value.
Just depends whether or not you can deal with a small amount of lost data in the event of a crash.
I've switched to PostgreSQL several years ago. Bullet proof, exactly the features I need (and more), and no problems like these. Speaking of, I'm about to upgrade my installations to 9.1 now.
As much as I like pg, I don't think there's anything magical it can do to fix this when the database is given a write-heavy workload such that the [sequential] commitlog appends outstrip the ability to do the [random] b-tree updates. Which is a pretty easy situation to set up on modern hardware, really.
When faced with this situation it seems to me you have three choices at a high level:
- throttle writes closely to your random i/o limit, eliminating your ability to accomodate brief write spikes
- throttle writes less aggressively, accommodating brief spikes but forcing stalls under sustained load as seen here
- don't throttle at all and postpone catching up until commitlog replay on restart, if necessary
> Are you running an equivalent hardware setup?
> there is no proof that PostgreSQL does not have any teething problems on high capacity hardware either
At Disqus we run Postgres on even larger boxes without pauses described here. We would have (had to) migrate off long ago if this were par for the course.
Not nearly as bad in magnitude, and with some shared responsibility with Linux and file system code (since PG relies of filesystem's caching). Greg Smith talked about it in some detail during PGWest 2010. One of his customers had long stalls in production and Greg worked out how to reproduce and explain them.
He had some patches he was planning for Postgres 9.1, not sure if they went in.
Is your database extremely write heavy? You don't usually see this problem in a more balanced read/write workload. Postgres has in the past suffered from stalls and lockups that can happen at checkpoint intervals, but it appears they mostly happen for seconds rather than minutes.
The issue is x86/x64 being a turd. Throughput of just about everything drops when you go over the 36Gb boundary for some reason. I have no idea why. It can drop by as much as 30%. This is from experience operating VMware on the physically the same type of kit (except with FC SAN / NetApp).
I'd personally rather buy high end SPARC64 kit which has linear performance scalability but we all know what happened to Sun, plus it doesn't run Windows anyway :( whimpers a little
That's what I thought too. There are a lot of weird problems that can happen due to NUMA architecture issues. Reading the numactl for me really drives home just how much I don't understand it.
I've seen DBs run on 128G machines on x6 and have no experience of such problems. Do you have a link or hard evidence of this?
I have seen throughput issues on VMWare machines (and false reporting on standard linux utilities, eg "time" output and various other issues that make me distrust it for performance-sensitive software), but don't know if this is to do with 128G machines; usually it's to do with contention on one network card handling all traffic to disks.
I haven't heard/experienced the 36 GB thing since I don't run anything that big right now. Are you aware of any websites or blog posts that talk about it?
Yet another self-promotion article. Oh, we stuff data in memory and then when it is time to write changes on disk everything got frozen! MySQL suck, buy our solution! Nice guys.)
First, it is useless to store data in memory if you want them be committed into disk storage. The general idea here isn't about switching to some new version of mysql or SSD disks, it is about to realize that you have a data-flow inadequate for your one-server architecture.
Second - check-point intervals should be adjusted to your actual data-flow, which means they should be executed often enough. If there is situation of almost constant checkpoint - non-stop data writes, that is the sign that you need to consider sharding/multi-server solution.
The hints that there must be no other disk activity on the same hardware volume or any swapping in OS, I suppose, are obvious. People who have a /var/log and /var/db on the same volume are idiots.
There are also good idea to use one file per table storage and put a data and physical logs on a separate hardware volumes (links are your friends). One raid-X volume that fits all is a quite naive solution. Raid isn't a guaranty of reliability. Replications to a back-up servers are.
Third, when you test your configuration before put it into production, you should tune-up your servers to perform with data and log syncing, and then figure out appropriate buffer sizes and checkpoint intervals. Then, in production, when you're experiencing an increasing flow of queries, you may choice to switch into different syncing strategy and/or more often but a little bit faster checkpoints.
Configuring mysql with huge buffers and no sync means lack of understanding the basic concepts, self-delusion and misuse of software and hardware. ^_^
"Yet another self-promotion article. Oh, we stuff data in memory and then when it is time to write changes on disk everything got frozen! MySQL suck, buy our solution! Nice guys.)"
You got this part all wrong. He explicitly said that 'their solution' is unlikely to do any better. This is a real problem.
It's not. The adaptive flushing algorithm has improved over the various versions, but checkpointing has been the cause of performance stalls for a long time. Here is an early bug report of the problem from 2006.
Last time I checked that was a recipe for terrible performance. We normally leave the flush_method set to default and keep a couple slaves around to minimize the potential for data loss.
I can't see what was wrong with the second graph. Regular bursts of writes for flushing leaves the disks more idle for low-latency reads. Optimising throughput doesn't mean you have to have a smooth, flat write graph.
I wonder if sharding in some clever way could alleviate the problem. Some way of having another instance processing queries while other one is flushing logs.
I have personally noticed MySQL having horrible performance for views that use joins. Every time a view does a join, it creates a temporary table and you might as well throw indexing out once that happens. I had a query that was taking 4 seconds with a view and used the same sql minus the view and got it down to 5ms. I know I am not being specific to flushing, but still it is an example of a potential performance problem.
I recently did learn about the difference between the merge and temptable algorithm approaches a MySQL view can take. I was using the default temptable approach and did not know about the merge algorithm that can be specified for views. By my experience with MySQL's views, MySQL does not do much to optimize execution plans for views.
By default MySQL will use the merge algorithm. It will create a temp table if the result of the view is large. You can control what "large" means. That is the better method of controlling the behavior, than potentially telling MySQL that it jeeds to create a multi GB result set but must keep it all in RAM. Read the manual or the O'Reilly book for more info.
Anither option is to use materialized views, which are not natively supported in MySQL, but can easily be simulated.
That is essentially what they are. The merge algorithm merges your query with the query of the view and returns the result. However, if the query of the view returns a very large result set, it is much faster to pre-flight it, then select against it.
Yeah, but letting users specify if views are MERGE or TEMPTABLE seems quite pointless. Why not always use UNDEFINED and let the planner chose. The planner knows which tables are large.
This is really terrible news for anyone trying to do high-throughput DB operations on mySQL.
The only "good" news is that the average web app isn't going to be running on hardware anywhere near as powerful as this, so you are highly likely to already be in the "slower", consistent state (e.g. anything less than 40GB of RAM).