Skip to main content

Recent Major Website Outage Was Caused By A Simple Mistake

By March 14, 2017May 25th, 2021Technology News

Last week, Amazon’s AWS (Amazon Web Services) suffered an 11-hour outage that resulted in dramatic slowdowns and complete unavailability of more than 100 large internet retailers and a number of the web’s top sites, including Amazon itself, Netflix, Imgur, Reddit and a host of others.

The situation has now been resolved, and Amazon has published an incident postmortem, which revealed the root cause: A typo.

A company employee was performing routine maintenance designed to remove a small number of servers for one of the S3 subsystems used for billing. Unfortunately, the command was incorrectly entered, and it inadvertently took down a large number of servers, which took the company much longer to restart than they originally anticipated.

According to the details of the incident post mortem, Amazon had not fully restarted some of the impacted servers for several years, which further complicated the restart. The company has modified the tool that was used to take the servers down so that it will do so more slowly in the future, giving the company’s staff more time to intervene in the event that they notice any unanticipated complications going forward.

While this is not the first time a cloud services provider has experienced an outage, it was easily the largest one we’ve ever seen, and it has raised questions about the reliability of those servers.

With the number of companies migrating to cloud-based service providers, the ripple effects can be enormous when one of those providers suffers an outage.

That said, cloud-based providers have an exceptionally good record to this point, and it should be noted that no company is immune to outages. Amazon’s recent incident isn’t really a viable argument against cloud migration. At the end of the day, equipment is still equipment, and no matter who manages it, there’s always a chance, however slight, of a complication resulting in downtime.

Jason Manteiga

Jason Manteiga

Jason J. Manteiga, Vice President of Olmec Systems, has been part of the company for over the past 20 years. He believes that having a great work environment and supportive team, is the ultimate key to success. Since being in the IT realm for over 25 years, Jason, along with Olmec Systems, has been on the Inc. 5000 “List of America’s Fastest Growing Private Companies” and Channel Futures MSP 501 “Top Managed Service Providers in North America,” along with other awards and nominations. Jason earned his Bachelor Degree in Information Systems from the New Jersey Institute of Technology. He also holds certifications in Microsoft MCSE, VMWare VCP, and Cisco CCNA. In his spare time, Jason is a contributor for The Center for Social & Legal Research (Privacy Exchange) and a member of the Morris County Chamber of Commerce. His hobbies include cycling and kayaking. He currently lives in New Jersey with his wife, two daughters and son.

Leave a Reply