Story line of the Amazon EC2 and RDS failure and recovery

By Brian April 23, 2011 Random thoughts 1 Comment

I wanted to read the timeline of the Amazon “Networking Event”. So I’ve taken the logs of the EC2 and RDS status updates and put them together for a post.

If you want to learn about how to make a more robust Amazon Web Services (AWS) configuration, read my article on the WebDevStudios blog.

RDS Apr 21, 1:48 AM PDT We are currently investigating connectivity and latency issues with RDS database instances in the US-EAST-1 region.

RDS Apr 21, 2:16 AM PDT We can confirm connectivity issues impacting RDS database instances across multiple availability zones in the US-EAST-1 region.

RDS Apr 21, 3:05 AM PDT We are continuing to see connectivity issues impacting some RDS database instances in multiple availability zones in the US-EAST-1 region. Some Multi AZ failovers are taking longer than expected. We continue to work towards resolution.

RDS Apr 21, 4:03 AM PDT We are making progress on failovers for Multi AZ instances and restore access to them. This event is also impacting RDS instance creation times in a single Availability Zone. We continue to work towards the resolution.

RDS Apr 21, 5:06 AM PDT IO latency issues have recovered in one of the two impacted Availability Zones in US-EAST-1. We continue to make progress on restoring access and resolving IO latency issues for remaining affected RDS database instances.

RDS Apr 21, 6:29 AM PDT We continue to work on restoring access to the affected Multi AZ instances and resolving the IO latency issues impacting RDS instances in the single availability zone.

RDS Apr 21, 8:12 AM PDT Despite the continued effort from the team to resolve the issue we have not made any meaningful progress for the affected database instances since the last update. Create and Restore requests for RDS database instances are not succeeding in US-EAST-1 region.

RDS Apr 21,10:35 AM PDT We are making progress on restoring access and IO latencies for affected RDS instances. We recommend that you do not attempt to recover using Reboot or Restore database instance APIs or try to create a new user snapshot for your RDS instance – currently those requests are not being processed.

RDS Apr 21, 2:35 PM PDT We have restored access to the majority of RDS Multi AZ instances and continue to work on the remaining affected instances. A single Availability Zone in the US-EAST-1 region continues to experience problems for launching new RDS database instances. All other Availability Zones are operating normally. Customers with snapshots/backups of their instances in the affected Availability zone can restore them into another zone. We recommend that customers do not target a specific Availability Zone when creating or restoring new RDS database instances. We have updated our service to avoid placing any RDS instances in the impaired zone for untargeted requests.

RDS Apr 21, 2:41 11:42 PM PDT In line with the most recent Amazon EC2 update, we wanted to let you know that the team continues to be all-hands on deck working on the remaining database instances in the single affected Availability Zone. It’s taking us longer than we anticipated. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.

RDS Apr 22, 2:41 7:08 AM PDT In line with the most recent Amazon EC2 update, we are making steady progress in restoring the remaining affected RDS instances. We expect this progress to continue over the next few hours and we’ll keep folks posted.

RDS Apr 22, 2:41 2:43 PM PDT We are continuing to make progress in restoring access to the remaining affected RDS instances. We expect this progress to continue over the next few hours and we’ll keep folks posted.

EC2 Apr 22, 2:41 AM PDT We continue to make progress in restoring volumes but don’t yet have an estimated time of recovery for the remainder of the affected volumes. We will continue to update this status and provide a time frame when available.
EC2 Apr 22, 6:18 AM PDT We’re starting to see more meaningful progress in restoring volumes (many have been restored in the last few hours) and expect this progress to continue over the next few hours. We expect that well reach a point where a minority of these stuck volumes will need to be restored with a more time consuming process, using backups made to S3 yesterday (these will have longer recovery times for the affected volumes). When we get to that point, we’ll let folks know. As volumes are restored, they become available to running instances, however they will not be able to be detached until we enable the API commands in the affected Availability Zone.

EC2 Apr 22, 8:49 AM PDT We continue to see progress in recovering volumes, and have heard many additional customers confirm that they’re recovering. Our current estimate is that the majority of volumes will be recovered over the next 5 to 6 hours. As we mentioned in our last post, a smaller number of volumes will require a more time consuming process to recover, and we anticipate that those will take longer to recover. We will continue to keep everyone updated as we have additional information.

EC2 Apr 22, 2:15 PM PDT In our last post at 8:49am PDT, we said that we anticipated that the majority of volumes “will be recovered over the next 5 to 6 hours.” These volumes were recovered by ~1:30pm PDT. We mentioned that a “smaller number of volumes will require a more time consuming process to recover, and we anticipate that those will take longer to recover.” We’re now starting to work on those. We’re also now working to enable customers to be able to launch EBS backed instances and create, delete, attach and detach EBS volumes in the affected Availability Zone. Our current estimate is that this will take 3-4 hours until full access is restored. We will continue to keep everyone updated as we have additional information.

EC2 Apr 22, 6:27 PM PDT We’re continuing to work on restoring the remaining affected volumes. The work we’re doing to enable customers to be able to launch EBS backed instances and create, delete, attach and detach EBS volumes in the affected Availability Zone is taking considerably more time than we anticipated. The team is in the midst of troubleshooting a bottleneck in this process and we’ll report back when we have more information to share on the timing of this functionality being fully restored.

EC2 Apr 22, 9:11 PM PDT We wanted to give a more detailed update on the state of our recovery. At this point, we have recovered a large number of the stuck volumes and are in the process of recovering the remainder. We have added significant storage capacity to the cluster, and storage capacity is no longer a bottleneck to recovery. Some portion of these volumes have lost the connection to their instance, and are waiting to be connected before normal operations can resume. In order to re-establish this connection, we need to allow the instances in the affected Availability Zone to access the EC2 control plane service. There are a large number of control plane requests being generated by the system as we re-introduce instances and volumes. The load on our control plane is higher than we anticipated. We are re-introducing these instances slowly in order to moderate the load on the control plane and prevent it from becoming overloaded and affecting other functions. We are currently investigating several avenues EC2 to unblock this bottleneck and significantly increase the rate at which we can restore control plane access to volumes and instances– and move toward a full recovery. The team has been completely focused on restoring access to all customers, and as such has not yet been able to focus on performing a complete post mortem. Once our customers have been taken care of and are fully back up and running, we will post a detailed account of what happened, along with the corrective actions we are undertaking to ensure this doesn’t happen again. Once we have additional information on the progress that is being made, we will post additional updates.

RDS Apr 23, 12:00 AM PDT We are continuing to work on restoring access to the remaining affected RDS instances. We expect the restoration process to continue over the next several hours and we’ll update folks as we have new information.

EC2 Apr 23, 1:55 AM PDT We are continuing to work on unblocking the bottleneck that is limiting the speed with which we can re-establish connections between volumes and their instances. We will continue to keep everyone updated as we have additional information.

RDS Apr 23, 8:45 AM PDT We have made significant progress in resolving stuck IO issues and restoring access to RDS database instances and now have the vast majority of them back operational again. We continue to work on restoring access to the small number of remaining affected instances and we’ll update folks as we have new information.

EC2 Apr 22, 8:54 AM PDT We have made significant progress during the night in manually restoring the remaining stuck volumes, and are continuing to work through the remainder. Additionally we have removed some of the bottlenecks that were preventing us from allowing more instances to re-establish their connection with the stuck volumes, and the majority of those instances and volumes are now connected. We’ve encountered an additional issue that’s preventing the recovery of the remainder of the connections from being established, but are making progress. Once we solve for this bottleneck, we will work on restoring full access for customers to the control plane.

EC2 Apr 22, 11:54 AM PDT Quick update. We’ve tried a couple of ideas to remove the bottleneck in opening up the APIs, each time we’ve learned more but haven’t yet solved the problem. We are making progress, but much more slowly than we’d hoped. Right now we’re setting up more control plane components that should be capable of working through the backlog of attach/detach state changes for EBS volumes. These are coming online, and we’ve been seeing progress on the backlog, but it’s still too early to tell how much this will accelerate the process for us. For customers who are still waiting for restoration of the EBS control plane capability in the impacted AZ, or waiting for recovery of the remaining volumes, we understand that no information for hours at a time is difficult for you. We’ve been operating under the assumption that people prefer us to post only when we have new information. Think enough people have told us that they prefer to hear from us hourly (even if we don’t have meaningful new information) that we’re going to change our cadence and try to update hourly from here on out.

EC2 Apr 22, 12:46 PM PDT We have completed setting up the additional control plane components and we are seeing good scaling of the system. We are now processing through the backlog of state changes and customer requests at a very quick rate. Barring any setbacks, we anticipate getting through the remainder of the backlog in the next hour. We will be in a brief hold after that, assessing whether we can proceed with reactivating the APIs.

RDS Apr 23, 12:54 PM PDT As we mentioned in our last update at 8:45 AM, we now have the vast majority of affected RDS instances back operational again. Since that post, we have continued to work on restoring access to the small number of remaining affected instances. RDS uses EBS, and as such, our pace of recovery is dependent on EBS’s recovery. As mentioned in the most recent EC2 post, EBSs recovery has gone a bit slower than anticipated in the last few hours. This has slowed down RDS recovery as well. We understand how significant this service interruption is for our affected customers and we are working feverishly to address the impact. We’ll update folks as we have new information. Additionally we have heard from customers that you prefer more frequent updates, even if there has been no meaningful progress. We have heard that feedback, and will try to post hourly updates here. Some of these updates will point to EC2’s updates (as they continue to recover the rest of EBS volumes), but we’ll post nonetheless.

About Author

Capt. Queeg

One Comment

Malcolm August 7, 2013

It’s remarkable for me to have a web page, which is valuable designed for my know-how. thanks admin

The Code Cave

Story line of the Amazon EC2 and RDS failure and recovery

About Author

Capt. Queeg

One Comment

Add a Comment

Related Posts

About Author

Capt. Queeg

One Comment

Add a Comment