I've read about Redis before and heard how companies are using it, but never completely understood it's purpose. After reading this I can actually say I understand Redis now and how it's useful. Amazing that after hearing so much about it all it took was a relatively simple article.
We use a ton of Redis, but I think the main takeaway from this article applies to all "NoSQL databases".
The "movement" is about polyglot persistence and not leaving RDBMS completely. Pull pain points out into something that's a better fit. Rinse and repeat.
I'm a little concerned with the added complexity of decoupling the datastore into two different systems that are relied on for application logic. What are good strategies to maintain consistency between the two, in the event of a failure?
In the first example, what if the SQL transaction succeeds but the redis one fails? Would you rollback the SQL transaction?
Yes, you can tell it to either make a copy to disk ever N seconds or use an AOF mechanism, but that doesn't mean it's in sync with your database. Depending on how your DB can get updated, you'll want to think about how the cache can get stale and whether you're expiring data in redis periodically, or removing keys/using different hashes for keys. The specific way to do this will depend on the app that you're using.
In most systems, you're probably already doing it. Any ORM that has a level 1/2 cache built in (like Hibernate and EHCache/terracotta/etc or SQL caching in Rails) is storing the data in more than one place. If the data is in the cache (or if the cache fails for whatever reason), you're in the same boat you are now.
If you're just using redis for write-through caching/memoizatoin and there's some failure, you still have the answer in hand and can return it to the user, you just lose the benefit of the speedup from the cache.
It's a big problem, at least for our application (chattybar.com - a chat plugin). We have 2 copies of every chat message sent, one in the DB (for persistency) and one in Redis for quick retrieval. Keeping them synced has a lot of edge cases, especially since multithreading means that adding things to the DB and then to Redis doesn't necessarily guarantee they'll end up in the same order.
If all you're doing is keeping a copy of objects in a memory store for quick retrieval you might want to implement a read-through caching pattern using Memcached (or Redis but I'd turn any persistence off). Most chat clients don't allow for editing of posts, so you could just use Redis as your persistent storage using lists and RPUSH.
What I get from this is that Redis is so powerful that its best to not use it as a simple read-cache where the database is still the cannonical source.
Its better to use it as the write-cache for complex datasets with the database being the backup.
Right. I believe we already have plenty of read-caches out there. Our web servers can implement them, we have Varnish and MemCache and ~100 others.
What Redis appears to do great is that it is much more than just a key-value store with CRUD operations. You can model sets, lists, queues, counters, and do complex operations on these in-memory values.
It looks to me like it is best thought of as providing an interface to data-structures that are well optimized and persist across requests.
Resque and redis-store are like auto-adds for almost any Rails project I work on these days.
Resque is for background jobs (with many add-ons for locking, scheduling, retries, etc.), and redis-store is a drop-in store for Rack::Session, Rack::Cache and Rails.cache. Easy and super fast.
In your first example, you use redis to cache the id's of the latests comments, with a fallback to SQL in order to populate the list. However, you still need to call the DB to load the comments. I don't see the gain here.
Yes, you've replaced a "select * from comments order by created_at limit 10" with a "select * from comments where id in (list_of_ids_from_redis)".
Wouldn't you cache the comment models in a top-10 list?
Often the DB will show unacceptable time to reply to the ORDER BY stuff but will fetch comments by single ID without problems. But when that second part is a problem as well it is a good idea to use Redis as a "vulgar" cache for items as well, so that the recent stuff are probably into a Redis hash and you can fetch everything with a single pipelined MULTI/EXEC call (and fetch all the items returned as nils from the DB).
When dealing with a lot of data, the latter SQL query would potentially be significantly faster than the former. If your goal is simply to show a top-10, then caching the entire results would be a great idea, but if you're goal (like in the article) is to make retrieving the first 5000 comments quickly, then this implementation is pretty solid.
I don't think the latter SQL would be significantly faster assuming the appropriate index on created_at.
The database can read the last 10 via the index directly and they would all most likely be on the same index page. Assuming any sort of normal caching, this would be at most 1-3 random reads and most likely none since I presume that the created_at index is generally be written in ascending order.
Once that step is done, it is essentially identical to your IN statement you did.
this is MySQL specific, but we do this and it is a HUGE win for us
MySQL doesn't support descending indexes, so for a large class of problems you will have to scan the entire index to find the last 10 items, especially when sorting by items in two directions. This is really slow when you have one hundred million entries (a huge events table). Looking up by primary key is very very fast in MySQL with InnoDB. If you profile this query you can see MySQL spends most of its time figuring out the IDs, and almost no time reading them back to you. Using Redis in this manner is very memory efficent, easy to update, and gets you 95%+ of the potential performance gains. It means we don't have to keep Redis up to date with edits or changes, because the PKs are set in stone.
If you replace "especially when sorting by items in two directions" with "only when sorting by more than one item in different directions" above when talking about composite keys, then yes that's a factual statement of MySQL and something I believe Postgres supports.
The same composite index in MySQL can be used to do ASC-ASC and DESC-DESC sorts but cannot be used to perform an ASC-DESC sort.
Can mysql really not use an index to do a top / bottom 10 query? In Oracle it can do a top n query avoiding a sort very quickly on massive datasets given an index on the sort column. Being that is a common pattern in web apps, which is probably mysqls target market you'd think that would be high up the dev priority list!
Uhm, MySQL indexes efficiently sort in both directions. You can use a composite index to achieve an ordered set with a set of while conditions, even with millions of rows. As long as the app is only fetching like 10 rows, it should be lightning fast.
Does anyone know if hdf5 would be an acceptable optional replacement for the current Redis disk format?
I have a console app that's backed by Redis (in much the same manner as described in this post), but I save my sessions to h5 when I switch between datasets. That means I need to combine the Redis data with my app data and export -- I do this using two separate h5 files, with with the appropriate links.
It would be nice (for me anyway) if I could do a Redis-native save, and move the resulting file. That would also improve my startup times when I reverse the process.
But, while h5 is nice for My data, I can't say it would be any good for generic Redis data...
These little fixes are how I got into Redis and a month or so later, it's a primary data store (with disk based fall back) and I find myself doing 99% of aggregation and temporary storage operations with it.
Usually the game server would keep everything in the game instance in memory, for an FPS or RTS. As great as Redis is, it's way more cumbersome than having everything in your data structures ready to go.
Redis is suitable for persistence, so where it would really be appropriate would be storing things like player profiles and other misc data that is not tied to a specific server. In fact, that's exactly what I'm using it for.
I realize that for most FPS or RTS titles, Redis might not be a perfect fit because the dataset is small for each map and the number of players never exceeds some preset limit like 64, but I think Redis might really shine as a means of storing persistent world data - for example MMOs, without having to shard heavily. Most modern MMOs use heavy sharding and can really only support a couple hundred players in the same area or shard at the same time. With Redis, you might be able to handle many thousands of players in the same area without sharding. You don't need sub-millisecond response time - as long as you can do most operations in a few milliseconds, things will be fine, since most players are on ~100 millisecond WAN connections anyway.
You might finally be able to break the model that current large games like WoW use - heavily sharded with all persistent data stored in Oracle. Redis should allow you to shard less and use SQL less, resulting in a much better experience, especially when parts of the virtual world get crowded.
I would be extremely surprised if WoW has issues when lots of people are crammed in a small space due to their Oracle backend. More likely it is all the calculations that the serverside is performing. I know a few people who work on other MMOs and they keep everything in memory, even accounts for logged off users.
I'm curious: what you think (or know?) they are doing that is causing the database to be a limiting factor?
I didn't mean to imply that the bottleneck was due to Oracle. I was saying the exact same thing you are saying - it's due to memory or CPU limitations in their application server, which forces them to heavily shard.
If Redis can help them scale from the small shard to larger shards that support more players in an area, it might be interesting.
Those kind of game servers are typically a binary blob that you cannot configure externally at all. Your only configuration options are in-game settings like the map list, server name, player limit, etc.
"The Cloud" isn't always the right solution for everything. If you're looking to go inexpensive, you can put it on a 2 GB Linode VPS on a ramen budget.
The confusing thing is some people refer to VPSs as 'cloud', which sort of makes sense when you compare them to EC2. EC2 seems more 'cloudy' because... um, we don't know what kind of hardware it's running on? I don't know. I was trying to explain what 'cloud' means to my tech-savvy, but non-industry brother and was puzzled by the issue of whether Linode is a 'cloud' service or not.
I guess it sort of could be, but I'd consider a "cloud" service something where you either a) can spool up or shut down servers dynamically from an image without any additional configuration (ala EC2 or Heroku), or b) a commodity service like Amazon S3 or MongoHQ where you pay by usage and they handle resource allocation.
I see scriptability and automation through an API as important to 'cloud' classification too. Good point about the automatic resource allocation, I think that's actually the key. On Linode you're paying for a segment of a specific machine, more like a dedicated server than EC2.
Why can't we call VPS as cloud! I think the "cloudy" character does not come just because we do not know how or where is it running or what the hardware is, it comes more because I can deploy an application or store a data somewhere on the earth right from my home. A better way to put it could be - VPS are first generation cloud/kind of raw version whereas EC2, GAE, Heroku et al. are the advanced cloud systems where some things could happen automatically.
How about classic shared hosting? You can go get a server on dreamhost or something and have a PHP app up in a couple of hours. It's not very advanced, but the effect is somewhat the same...
Thanks for the suggestion. I've been working on a project that does dozens of writes per second and on my local machine I used redis. The costs of running redis for this on EC2 is prohibitive. The costs at Linode for even a 1gb vps is quite reasonable and scaling that up to larger sizes isn't that much more.
RAM is expensive in the cloud. So unless the cloud is a strict requirement it will normally be much cheaper to build out a box with 32G (or whatever fits your need) and run redis on that.
Yes. RAM pretty much decides over how many instances you can sell per rack-unit. Disks can be virtualized, CPU can be spread thinly - but a slice of RAM needs to be committed, no matter what.
This is a great example of how to promote adoption of a new technology. More companies should pay attention to how their product can be used rather than what their product is.