Amazon cloud glitch highlights cloud fallibility

According to Computerworld:

“Amazon began reporting trouble on its Service Health Dashboard about 5 a.m. Eastern today [April 21, 2011]. At 5:16 a.m., the site reported connectivity issues that were affecting its Relational Database Service, which is used to manage a relational database in the cloud, across multiple zones in the eastern U.S.”

Many customers lost hosting ability in their EC2 (Elastic Compute Cloud), which is their pay-as-you-go service, and there were also problems with EBS (Elastic Block Storage) which is the storage back-end used by EC2 services.  When EBS has problems, data volumes for various customers become unavailable causing outages.  In this case, the outages were quite lengthy.

Amazon reported that they had a networking event which caused massive re-mirroring of EBS volumes.  EBS is a “protected data service” which means that your data is mirrored on several sites so it won’t get lost if one site goes down.  When links are broken between these servers, they must be re-synchronized when the links come back up.  This is known as re-mirroring.  Normally this isn’t a problem, but when a large number of volumes all try to re-sync at the same time, it takes up too much bandwidth and causes outages because the volumes take a long time to re-sync.  So, while you don’t lose data, you loose access to it for a time.

Many customers were blindsided by this and were scrambling to find some way to get their services back on-line.  People have a tendency to forget that cloud based services run on a datacenter, just like the computers they have in-house, and are subject to the same problems and outages.  While there are a lot of high-availability features, and it’s not your in-house people who have to solve the problems and keep things running, problems can and do occur.

What’s the moral of this story?  The cloud needs to be treated appropriately like any datacenter.  Disaster Recovery and Business Continuity need to be addressed just like the computers in your company datacenter.  Putting processes out into the cloud does not absolve you from that responsibility.  Hopefully, we’ll all benefit from this “wake up” call, and plan appropriately in the future.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s