At nearly every DevOps forum, meet-up, or group meeting, I hear the battle cry from management or Ops: It?s OK to fail. That we learn from failure. That they (whoever they might be) shouldn?t fear failure since the changes are small and thus recovery is rapid. I have also observed that these same people (mostly developers) have not worked in heavily regulated industries where failure has serious consequences, and people can endure real harm. In my experience, there is a great deal of naiveté among the ?OK-to-fail? crowd.
It?s a well-worn story, but it bears repeating. On August 1, 2012, Knight Capital went live with software to take advantage of a NYSE retail investor trading program. Eight minutes after the market opened, market experts noted that fourteen stocks were already trading at a higher volume than the S&P 500 SPDR - something that never happens. By 10 am, fifty-one stocks had higher trading volumes than the SPY. Knight Capital was the culprit. Rogue trader pressing the wrong button? Cyber-attack? Nope. It was a computer glitch.
Some of the older code that was part of the software release triggered waves of buy-sell orders when the system went live. Knight accumulated a $7 billion position before the NYSE took them offline. By the end of the day, Knight Capital had managed to limit the damage to itself to the tune of $440 million in losses and had to sell ownership to get an infusion of money to help cover the losses. However, people, real human beings, lost a lot of money that day due to a bad release.
The NYSE was calling Knight Capital to fix the problem just five minutes after the market opened. Five minutes was too long to wait to recover from a production failure. Ultimately Knight Capital was bought by Getco and the incident remains a landmark lesson in software development. It?s fine to fail in non-prod (or safe) environments, but not in prod environments. Prod failures open companies up to a multitude of risk factors ? PR being the least of these.
NASA experienced a similar prod failure. Years back, their Mars Polar Lander crashed due to a malfunction that caused it to shut down its main engines before it had reached the planet surface. Like Knight Capital, they did not fail safely! Prod was a critical environment millions of miles away. We need DevOps experimentation to be kept safely on the ground.
Fail Frequently, Fail Safely
I am a believer in failing and learning from failure within the DevOps context. I just believe that these valuable experiences should occur in non-prod environments and without exposing your company to unnecessary risk.
You can understand why DevOps engineers are screaming for the ability, or the permission, to fail frequently. In every industry, business is piling on the pressure to develop faster, beat the competition to the next new innovation, and stay agile. Release cycles reflect this pressure and DevOps teams feel the pressure to release code at almost any price. DevOps teams are supposed to work like Ford?s moving assembly line: collaborating closely to implement ideas and innovations into production.
Too often though, the rush to deliver code takes precedence over other considerations. Developers get frustrated with security teams, QA teams, ops teams, and others that are seen as slowing down the release process while they check, consider and weigh everything. Too often development is measured by deliverable frequency and not prod outcomes. The problem here is that recklessly giving in to this pressure undoes DevOps principles making the deployment pipeline go retro as code is thrown over the wall to non-Dev teams, and begins to pile up. Projects get delayed.
I believe there has to be a slight culture shift within DevOps where the belief is that it is OK to fail frequently while also failing safely. Experiment in safe environments and learn from mistakes. We all know that some of our best ideas come out of learning from failure. However, did you learn to ride a bike on the shoulder of a busy road? Do athletes perfect their technique and training at game time? The answer is no. So why do we, within DevOps, think it is OK to fail in prod?
Every human endeavor is fraught with the unexpected, the unforeseen, and imperfection. This being the case, why compound the human part of the equation by allowing failures in prod that could reasonably have been discovered or experienced in safe non-prod environments?
Knight Capital, NASA, and a shopping list of other names learned the hard way that failure in prod can be lethal?so keep that failure to non-critical environments. Prod is where a competent best practice must be employed.
Prod releases should be the execution of proven steps that are the result of multiple rehearsals, failures, and multilateral input of subject matter experts. It also demands automation: streamlining the different testing phases through automation reduces the opportunity for error, accelerates the DevOps process, and helps ensure new services are launched faster.
Here?s a final thought. During my career as a developer, I found the quality of my code was at its highest when I was also responsible for supporting what I released in prod for some initial ?warranty? period. I got tired of dealing with problems off-hours and quickly became more conservative in my sprint commitments and impact analysis for the changes I was making. I also provided more insightful logging, but that is perhaps a whole other discussion. The point is this: holding developers accountable for frequency AND outcomes will increase the velocity and reduce the failure of your software releases.