What the Bing outage tells us about Release Management

As many of you are now aware, Microsoft’s Bing search engine suffered a 30 minute outage yesterday. According to CNET:

A Microsoft representative told CNET on Friday that the problem appears to have come when something being tested was moved onto the live site.

I spent a large portion of my career managing mission critical IT Services at GoDaddy.com, an environment with an absolute “zero tolerance” attitude toward unplanned downtime. So, before I climb up onto my soapbox, allow me to acknowledge that I was not there, and that it’s always easier to play “Monday Morning Quarterback” than to manage these types of situations as they happen. </disclamier>

Now, then… when looking at mission critical IT failures, Gartner showed that 80% are rooted in “people and process failures”, and that only 20% are collectively caused by hardware failures, software failures, natural disasters, etc.  This outage is no exception - while the failure was code (or configuration) related, the ultimate root cause would seem to fall under the “Release Management” process.

Although we have an incomplete picture, most of what I’ve read on the subject seems to suggest that changes were made in the test environment and then deployed directly into the live (production) environment.  This is the reason that the ITIL books explicitly state that the Definitive Software Library (DSL) is the “sole provider of software for use in a release”.

Along the same lines, the test environment is generally not the best place to be making changes.  Changes are made in Dev, tested in Test, and (if successful) checked into the DSL for subsequent deployment to production.  The integration between Change Management and Release Management should have provided for a review / risk analysis / approval process prior to deployment into production.

Additionally, the majority of the outage window was spent rolling back the change.  This would seem to indicate that the rollback / remediation plan was either non existent or woefully inadequate.  Otherwise a change requiring a 30 minute rollback would seem to fall under risk category that would have it scheduled outside of peak hours.

With our currently incomplete information, it’s impossible to know whether steps in the process were bypassed altogether, or whether the error fell in the execution.  In other words, it’s possible that the analysis and approval process existed and was followed, and that the CAB simply made a mistake or made the decision based on faulty information.

While no amount of planning and process can prevent every outage, a significant number can be avoided.  On the face of it, this would seem to fall squarely into that category.

Comments

  1. December 4th, 2009 | 2:09 pm

    [...] · View Beyond20: What last night’s Bing outage tells us about Release Management: http://www.itsmnow.com/?p=19 #ITIL #ITSM #Bing 2009-12-04 20:59:46 · Reply · View bxhonee: #2009faillist Bing [...]

  2. June 10th, 2010 | 12:08 pm

    Hello! Please e-mail me your contacts. I have a question webmaster@bravto.ru” rel=”nofollow”>……

    Thank you!!!…

  3. July 21st, 2010 | 9:06 am


    Medicamentspot.com. Canadian Health&Care.No prescription online pharmacy.Special Internet Prices.Best quality drugs. Online Pharmacy. Order drugs online

    Buy:Lumigan.Retin-A.100% Pure Okinawan Coral Calcium.Synthroid.Actos.Mega Hoodia.Accutane.Prevacid.Valtrex.Zovirax.Petcam (Metacam) Oral Suspension.Zyban.Human Growth Hormone.Prednisolone.Nexium.Arimidex….

  4. &
    August 29th, 2010 | 8:15 pm

Leave a reply

You must be logged in to post a comment.