Skip to main content
Blue Screen of Death

CrowdStrike: The Blame Game

Reading Time: 5 minutes

 

So, another huge IT outage occurred.  This time, involving CrowdStrike.  It seems like everyone and their pet tortoise has an opinion on this, so I didn’t jump in immediately.  Here are some things I haven’t personally seen mentioned.

 

 

Where’s Your DevOps Now?

 

Remember when DevOps was this shiny, new thing that everyone wanted to get involved in, while many completely missed the point of what it actually is?  Well, I won’t get into all that, except to say that people were into it and keen to be seen as “doing DevOps”.

 

One key practice of DevOps is the blameless post-mortem.  Blameless.  The idea that it could have happened to anyone, because the systems and processes in place weren’t enough to make it not happen, so we need to focus on those and improve them, rather than pointing the finger at the poor individual(s) who happened to fall into the gap.

 

It seems that the tech community as a whole have forgotten this, as I see two main camps (my network being skewed towards people specialising in the testing discipline):

  1. Let’s all laugh at their misfortune and the fact (is it a fact?) that they didn’t care enough to invest in quality and testing / fired a bunch of testers / failed to test adequately.  Main point: lol.
  2. Hey, don’t criticise the testers, it wasn’t their fault, things happen.  Main point: God, I hope this doesn’t make all testers look bad.

 

Common theme?  Blame.  Either assigning it, or trying to dodge it.  It seems we’ve forgotten that dwelling on whose fault it was isn’t actually very helpful in pinpointing how it happened, and coming up with solutions to avoid the problem in future.  Just firing that person or even giving that one person further training doesn’t actually do anything to prevent it from happening to someone else.

 

 

A Whole Team Approach – to Success and Failure

 

“If it’s everyone’s responsibility, it’s no one’s responsibility.” => “If it’s not one person’s fault, it’s everyone’s fault.”

 

Well… Kind of.  In the sense that everyone has a personal responsibility to do what they can to increase, or at least maintain, the level of quality.  Not just the quality of the product or system, but also the processes and practices.

 

Did you know: Testing doesn’t have to be exclusively performed by testers??  Testing is an activity, not a person.  I find that teams are generally more successful when there’s a testing specialist on board, but that doesn’t mean that other people on the team (developers, product owners, UI designers, etc.) can’t also test and care about quality.  This is often referred to as a “whole team approach” to quality.

 

Sadly, it’s often the case that successes are attributed to the whole team, whereas failures are pinned on individuals; usually a tester who was expected to “assure” quality or act as a “gatekeeper”.

 

But when we see something, we should say something, regardless of our official job title or role.  #OneTeam, right?  If you see something that isn’t right, or should be improved, you should say something.  If that thing causes issues in future, you’ve contributed to that failure by failing to act on your observations.

 

But let’s be clear.  Taking action doesn’t have to mean single-handedly implementing the fix or improvements.  I’ve been unfortunate enough to fall victim to the attitude of, “whoever smelt it dealt it” => “whoever brought it up is responsible for it; all of it”.  Unsurprisingly, this leads to people seeing something, yet saying nothing.  That’s how quality stays low and processes remain inadequate.

 

If you agree with me on this principle, great.  (If not, let’s discuss it in the comments!)  But be prepared for negative reactions.  I once spotted some issues with a product on which I was conducting user journey-based integration testing, but the affected area fell within the responsibility of a different team.  When I informed them of the issues, the response was extremely hostile.  I was essentially told, “keep your nose out of our team’s stuff, it’s none of your business,” despite their work not only being a part of the same company that we both worked for, but also totally relevant to the testing I was doing.  Luckily, someone with a more impressive job title than mine stepped in before I had the chance to respond, and explained why it was absolutely my business.  Just a cautionary tale to keep in mind while doing the right thing.

 

 

Surprises from the Preliminary Post Incident Review

 

I took a look at the preliminary PIR directly from CrowdStrike, and a couple of things stood out to me, which I’d like to briefly touch upon.

 

… Due to a bug in the Content Validator … [and the] trust in the checks performed in the Content Validator …

 

I don’t know whether the “Content Validator” is some other program in its own right, or just the name they gave to this particular sub-set of automated checks.  Regardless, review your code, test your code, write quality code – whether it’s part of a consumer product or not.  Test code is still code, and will do precisely what we, humans (even if via AI), tell it to.  Code is fallible because people are fallible.  If you wouldn’t trust a human blindly, don’t trust a computer blindly either.  You can use the principles of mutation testing to test your tests.

 

In addition, even if your testing is performed by humans, don’t blindly trust that a high number of “passing” tests means no issues.  There will always be unknowns.  Techniques such as exploratory testing can help uncover those.

 

… How Do We Prevent This From Happening Again? … Local developer testing …

 

Your developers weren’t testing locally before?!  This is all too common.  People make jokes about the phrase, “it works on my machine,” but that seems to be a step up from not even having run it on your machine…  But, hey!  Blameless.  The dev team(s) didn’t have the culture or processes to test locally before; they’ve now realised that was an issue, and they’re going to address it.  Good for them.  Perhaps try pairing for fast feedback and fixes.

 

… Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment. …

 

Honestly, with such a large user base, and such severe consequences of anything going wrong, I’m surprised they weren’t already doing this.  But I’d say this is a case to show that delivering high quality software isn’t just about testing.  Things like deployment strategy also play a part in delivering value, as well as mitigating risk.  We need to broaden our mindset a little, to look at all the cogs in this complicated machine.  Don’t know where to start?  Perhaps the Cynefin framework can help.  Want to manage risks better?  Look into RiskStorming.

 

 

Take Aways

 

  • Focus on finding solutions, not assigning blame
  • Quality is everyone’s responsibility
  • See something, say something
  • Don’t trust automation blindly
  • It doesn’t stop at test cases – explore to uncover more
  • Test that your tests test what you think they test 😉
  • Think beyond testing activities and the SUT to further improve quality

 


 

Like what you’ve read and think you could use someone like me on your team?  You’re in luck!  I’m looking for work in Quality Engineering and Scrum Master roles, and am available immediately.  Get in touch to discuss how I can support you.  You can also find me on LinkedIn.

2 thoughts to “CrowdStrike: The Blame Game”

  1. Developers sometimes will not do local testing, in this case it’s blindingly obvious (to me at least) why they would not do “local” testing. I did this a few times in one job, and when I failed I wasted a day re-installing the operating system on my workstation. Local testing here, really means the company providing you with 2 boxes, sometimes with a virtual machine, a bare-metal not a pointless virtualbox virtual machine probably. Although a Virtualbox guest would probably have caught “this” one this time. SUTs are really hard to get right.

    1. Thanks for sharing, Conrad. I think you’re right that doing testing literally on your local, every day machine sometimes isn’t the right option. I suppose what I was thinking about in this context was the difference between trying to run and operate your own changes, before pushing them to shared environments, and just not checking your own work at all, on any machine. I’ve seen devs on both sides of the spectrum.

Share Your Thoughts