Automating Crisis Response: Lessons from the CrowdStrike Incident

General , IT Use Cases , zenphi Use Cases

What could companies that suffered have done better, and how can automation help mitigate the results of similar events in the future?

Table of Contents

CrowdStrike Incident: What Do We Know About It

CrowdStrike Holdings, Inc. is an American cybersecurity technology company based in Austin, Texas. It provides cloud workload protection and endpoint security, threat intelligence, and cyberattack response services. CrowdStrike collaborates with companies like Microsoft to deploy tools such as Falcon to protect against hacking and security threats.

On Friday, July 19, 2024, CrowdStrike released a configuration update for its Falcon Sensor software, installed on Windows PCs to detect intrusions and hacking attempts. While the update was intended to bring minor improvements that customers would have barely noticed, it instead caused significant problems due to a logic error in the update software. Many computers running CrowdStrike services faced repeated reboots and the notorious Blue Screen of Death.

Impact of the CrowdStrike Incident

The CrowdStrike update incident had a profound impact, affecting nearly 8.5 million Microsoft devices across various user groups. The incident caused a significant IT outage that reverberated globally. Critical systems faced disruptions, leading to widespread consequences. Downtime from the outage led to significant financial losses due to halted operations, lost productivity, potential fines, and costs related to breach mitigation. According to Parametric Impact Analysis, Fortune 500 companies lost $5.4 billion in the outage (Parametrix, 2024).

What Affected Organizations Could Have Done Better

This incident underscores the necessity for several strategic changes, from more effective patch management to better incident management. New approaches must be adopted to ensure that organizations can restore normal operations after an incident as soon as possible.

According to the most respected tech publications, organizations that faced restoration challenges struggled due to several key factors. Some of them were beyond their control, as they could not influence how patches were deployed. However, there were still several aspects of incident management that, if implemented properly, could have saved millions.

Let’s look closer at these factors.

Incident Response Plan

A well-thought-out incident response plan must integrate the automation of incident access management.

Think of your company’s incident response plan. Is it smooth, efficient, and mostly automated to ensure consistency of operations during turmoil? Or does it resemble the response plan most companies have? Or maybe it’s nonexistent?

When it comes to user access management, most incidents assume you have to provide force-majeure access to certain users to try to mitigate the outcomes of shutdowns for your customers. In many companies, handling access is a cumbersome process that requires multiple approvals through emails. In a crisis scenario, this kind of access handling is unrealistic. The best alternative to this back-and-forth email process is automation. An effective incident response scenario can include automated access provisioning for specific roles or personnel to prevent data compromise. Similarly, automated access deprovisioning should be triggered by specific events, such as an email or after a designated time period.

Zenphi can automate these processes, ensuring rapid and secure access management.

Incident Reporting

Have you tried to report a failing device incident from the very same device that has failed? Now try doing the same when all available devices are failing, and your IT support is being crushed by the number of incoming messages. Writing the worst-case scenario for the incident reporting workflow process is a must, and automation is a great help here. Automated workflows can trigger alerts to relevant teams, enabling faster response times and minimizing the impact of incidents. [Read a detailed blog post here on how to automate incident reporting with Zenphi].

Automating IT Operations

Automating IT operations can save valuable time for your team, allowing them to focus on strategic tasks rather than mundane daily routines. Your team is composed of smart individuals who are highly compensated; ensure your investment pays off by enabling them to devise solid plans to prevent information leaks or fast recovery protocols after infrastructure shutdowns.

Zenphi can handle routine tasks for your IT team, especially automating Google Workspace admin tasks, freeing up valuable time for your team to concentrate on critical activities.

Crisis Communication

Effective communication is crucial during a crisis. Did all the companies affected by the CrowdStrike incident handle their communication perfectly well? Doubtfully so. To avoid lengthy recovery and customer backlash in the future, ensure your crisis communication workflow is fully automated and triggered by a certain event (which we all hope would never happen, but there are no guarantees). Zenphi, among other solutions, can automate communication workflows to ensure timely and accurate dissemination of information to stakeholders. This includes sending automated notifications to employees, customers, and partners, providing updates on the situation, and outlining steps to address the issue.

What Employees Who Deploy Software Could Have Done Better

Only people far removed from the IT industry have not yet mentioned that similar mistakes in deployment can lead to disastrous outcomes in the future. In the case of CrowdStrike, there was apparently a share of bad luck. But even if we remove the luck factor, certain principles are crucial for every department or employee who deploys software.

Detailed Checklists and Validation Steps

These can be created and enforced using automation platforms like Zenphi. Automating the deployment protocols ensures that checklists are completed before an update is deployed. In this case, nothing will be overlooked, and the factor of luck or human error is eliminated.

Consistency Across Deployments

By automating the deployment process, Zenphi ensures that each update follows the same rigorous process. This consistency reduces variability and helps maintain high standards across all deployments. Automated workflows can be configured to enforce specific policies and procedures, ensuring automated security, compliance with internal standards and industry regulations.

The #1 Workflow automation platform for google workspace

We’ve already helped hundreds of companies to implement best practices in security automation.  Book an obligation-free conversation with our Google Workspace automation experts to get a personalized advice.

Other Aspects of IT Professional Routine That Could and Should Be Automated

Pre-Deployment Testing: Automating the creation of test environments and executing stress tests before deploying updates to production ensures rigorous adherence to deployment procedures and eliminates the risks of massive shutdowns.

Rollback Procedures: Implementing automated rollback procedures that can be triggered in case of deployment failures will ensure minimal downtime.

Compliance Checks: Automating consistent compliance checks to ensure that all updates adhere to industry standards and internal policies makes your IT infrastructure more robust and crisis-resistant.

Let us know if you can come up with other processes that can be automated, and we’d be excited to help.

About The Author
Picture of Fernanda López Guerra, CS @Zenphi
Fernanda López Guerra, CS @Zenphi

Fernanda is an experienced Customer Success manager with over 9 years in Tech and B2B Saas. She has automated multiple operations for Zenphi customers in Education, Retail, Tech and other verticals.