Lessons from American Eagle's devastating e-commerce site failure.
Full Article Below -
American Eagle’s e-commerce website was down completely from July 19th to July 23rd, and only partially functioning until July 27th, 2010. They sell, on average, almost a million dollars a day across their e-commerce sites (ae.com, aerie.com, 77kids.com and martinandosa.com). So this eight-day long outage was one expensive fiasco, both in terms of revenue and reputation.
A post in StoreFrontBackTalk reports on some of the technical details: a disk drive at the IBM hosting site went down, and then the backup drive also failed. The replacement drive could not be restored using Oracle’s backup utility. Then, although American Eagle was paying for a backup site to be maintained, when they tried to switch to the disaster recovery site, they could not “get the active logs rolling,” which meant that they could not restore the site. This was reportedly due to a lapse in maintaining the backup site.
Who’s Managing the Managed Service Provider?
As in the Apollo 13 mission, this was a case of the failure of multiple redundant backup systems, except in this case there appears to have been some negligence involved. This brings up the question: Who is managing the Managed Service provider? This has become a very important question as more and more companies are relying on Managed Service Providers (MSP), as well as Software-as-a-Service (SaaS) and other “as-a-Service” providers for critical functions in their enterprise. When these systems go down, the cost can be in the millions of dollars. This type of high visibility failure gives a black eye to hosted services in general, including SaaS vendors whose biggest sales hurdle is often reservations about the security and availability of an outsourced service. Of course, who says that these types of disasters don’t happen on the watch of in-house operations as well? There is no guarantee that an in-house staff will be more diligent.
SLAs to the Rescue? Maybe Not.
So what’s an enterprise to do? Most MSP and SaaS contracts include SLAs (Service Level Agreements), sometimes with a penalty associated with non-performance (e.g. 5% of the monthly fee for each 30 minutes of downtime, up to 100% of the customer's monthly fee). But the penalty is typically tiny compared to the potential damage that could be incurred by the customer. One potential solution to the problem is to try to get the vendor to share the pain, perhaps including clauses that let the customer recover some significant portion of the damages sustained in case of negligence. If IBM was on the hook for lost revenues from American Eagle’s site, maybe things would have turned out differently. However, it can prove to be extremely difficult to get service providers to take on this type of risk. If you can get one of those clauses in your contract that’s great, but to do so, the service provider may charge a “risk premium” that customers of MSPs are not willing to pay.
Learning from the Manufacturing Outsourcing Experience
We can also learn some lessons from the migration to outsourced manufacturing that has become so prevalent in recent decades. In the early stages of this migration, many companies dramatically underestimated the degree of responsibility that they must retain to manage and continually monitor the performance and quality of their outsourced partners’ manufacturing operations. As a result, there were many painful episodes of worsening lead times, quality, and service levels. Over time, many firms have put in place the systems, processes, and personnel needed to provide more visibility, accountability, and ability to rectify problems.
We will likely see a similar evolution in how managed services and SaaS relationships are managed. Companies will realize that they must take on greater responsibility for the reliability and security of the services they consume. So what exactly does that mean? In addition to the upfront due diligence, it is the ongoing practices—perhaps regular checks of service providers, procedures, and process; verifying that they are regularly performing disaster recovery tests and examining the results; establishing and verifying checklists of critical activities; and so forth. I think that we are in an industry-wide learning process to develop effective strategies for ensuring the reliability, and for that matter the security, of managed services. In fact, it is in the interest of service providers to encourage that type of maturing of these relationships, to prevent outages like the recent one with American Eagle. Only then will these new service-based models reach their full potential.
To view other articles from this issue of the brief, click here.