Clouds are not a cure-all

Clouds are not a cure-all

Make sure that you understand what you are paying for in the cloud. The ability to scale on demand is nice. Having your apps run on random hardware that is failing, or overloaded is not so nice.

 Excerpted from:http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for_6_of... Amazon's EBSs are a barrel of laughs in terms of performance and reliability and are a constant (and the single largest) source of failure across reddit. reddit's been in talks with Amazon all the way up to CIOs about ways to fix them for nearly a year and they've constantly been making promises that they haven't been keeping, passing us to new people (that "will finally be able to fix it"), and variously otherwise been desperately trying to keep reddit while not actually earning it. reddit's been trying very hard to avoid bad-mouthing Amazon in public (in fact I was fascinated to see that jedberg was willing to see them mentioned at all here) but the fact is that their EBS product alone accounts for probably 80% of reddit's downtime. Note that with 80% of reddit's downtime removed, they'd be on par with services of a similar size. The root of the problem is that EBSs are too slow to use single EBSs for a database machine servicing as many requests as reddit's are. So you have to RAID them together. But that increases your surface-area for risk in failure. Couple that with ridiculously high variability in their actual performance (one will all-of-a-sudden slow down 10x to 100x, often across a large quantity of disks). This tarpits any requests sent to that DB (for the query timeout plus the detection period until we can route around it), causing many many micro-downtimes throughout the day, accounting for the vast majority of the "you broke reddit" pages.

More recently we also discovered that these disks will also frequently report that a disk transaction has been committed to hardware but are flat-out lying. So after the wave of slowness passes we find that the data on a machine may be physically corrupted. That's bad. If it's a slave, you have to re-replicate the whole slave (which hurts the master as well as the other slaves as they pick up his slack). If it's a master, you have a crapload of manual data cleanup to do. Just anecdotally, when trying to bring up some new Cassandra nodes a few weeks ago, alienth ran performance tests on every disk he allocated to make sure that we didn't get any impacted ones. I don't remember the exact number but it was something like four or five consecutive bad disks (disks operating at greater than 100ms of await with minimal load is how Amazon measures this).