Configuration Drift: Why It’s Bad and How to Eliminate It
Configuration drift is when the configuration of an environment gradually changes and is not in line with requirements. Eliminate it with immutability and GitOps practices.
What Is Configuration Drift?
Configuration drift is when the configuration of an environment “drifts”, or in other words, gradually changes, and is no longer consistent with an organization’s requirements. Configuration drift happens when changes to software and hardware are made ad hoc, without being recorded or tracked. This can lead to unexpected system behavior, instability, or downtime. Configuration drift also has a major impact on security vulnerability management.
In the DevOps world, it is critical to ensure development, test, and production environments are as similar as possible. Configuration drift in any of these environments can explain why code works in one environment but not in another. For example, code that worked perfectly in a testing environment can fail to deploy in a production environment – this could be due to configuration drift of the software itself or its surrounding environment.
In this article:
Why Configuration Drift Matters
Configuration drift can lead to serious consequences, including:
- Security vulnerabilities—misconfigurations or unauthorized changes to configuration can lead to issues like escalation of privilege, use of vulnerable open source components, vulnerable container images, images pulled from untrusted repositories, or containers running as root.
- Inefficient resource utilization—configuration drift can result in over-provisioned workloads or older workloads that keep running when they are no longer needed, which can significantly impact cloud costs.
- Reduced resilience and reliability—configuration problems in production can cause crashes, bugs, and performance issues, which can be difficult to debug and resolve.
This emphasizes the need for a robust solution to the configuration management problem.
Related content: Read our guide to vulnerability scanning process
Configuration Drift Examples
Here are two examples of how configuration drift can present a risk to an organization.
Example 1: Acceptable Risk
Suppose the company adds a new section to the application enabling customers to use services more easily. The business team achieves this by opening a communication port on the firewalls and servers for the proprietary protocol and creating a change ticket.
If this change is not properly documented, an auditor will eventually find the port and ask why it’s open and whether its risk is acceptable. It could take hours for the security team to track the origin of the open port.
The risk in this scenario is acceptable, but the rationale for opening the port is not readily apparent to the auditor. The security team would have saved time if it had tracked the configuration drift.
Example 2: Unacceptable Risk
An application developer must repeatedly log into a server to check things or make small changes. The developer can easily use the regular account to log in but must obtain special credentials to make administrative production changes. Checking the admin credential each time is tedious, so the developer adds the user group to various user privilege classes.
Modifying even one server to enable easier access poses a significant risk to the company. The security team must manually audit the server to know what changes the user makes.
7 Common Causes of Configuration Drift
Here are some of the most common reasons for configuration drift:
- Software/firmware patches—applying patches and updating network equipment is an important part of maintaining a secure network. Unfortunately, it can also lead to configuration changes. For example, updating the firmware of a network switch can enable new network services or change its settings.
- New resources—in many environments, both on-premise and cloud-based, new devices are constantly added to the network. It is important to ensure that each new device or resource matches the expected configuration, but under pressure this is often overlooked.
- Temporary fixes—a major day-to-day function of IT operations is to solve problems in computing systems and restore them to normal operations. These could be networking issues, malfunctioning endpoints, or application issues. IT staff will often change a system’s configuration to resolve a problem and get it to work again. These temporary fixes provide an immediate solution but will result in configuration drift down the line.
- Insufficient communication of changes—when anyone in a DevOps team makes a change to configuration, they need to make this change known to the entire team, and also receive permission to make the change. In reality, many changes are made without proper communication, leading to unknown configuration drift.
- Lack of clarity about expected state—in many cases, the expected state of a system is not well defined to start using it. Individuals in a team might decide to make changes, without realizing they cause the system to diverge from its expected state.
- Changes made by end users or customers—in many business environments, customers can also make changes to infrastructure, especially if a company deploys software or equipment in a customer’s data center. Without close communication with the customer, it is easy for customer staff to make changes that introduce configuration drift.
- Out of date documentation—when documentation is created manually, it inevitably gets out of date. Teams working on urgent developments or fixes rarely have time to update documentation to reflect all their changes. Because documentation is a low priority, changes are not documented and the system drifts out of its expected state.
Key Strategies for Eliminating Configuration Drift
It is possible to minimize configuration drift with the right tools and automation techniques. However, as long as DevOps teams are able to make manual changes to applications and infrastructure, drift is inevitable. These gradual changes in software configuration are often not observed until an application failure occurs. When that happens, engineers spend valuable time trying to understand why something that worked in one environment behaves differently in another.
There are two key strategies that can eliminate configuration drift once and for all:
- Immutability—organizations are transitioning to immutable infrastructure, such as containers, which by definition cannot be changed after it is deployed.
- GitOps—this is a new development process that stores all configuration as code in a central Git repository. Any changes to live environments require a pull request to this repository, which must be properly authorized and has a full audit trail.
Immutable Infrastructure
Immutability is the quality of not changing. Using immutable infrastructure requires provisioning components up front and never touching them after deployment. This approach means an organization never updates components like containers and servers. Instead, the infrastructure management destroys old components and replaces them with new ones.
Immutable servers are the most common immutable infrastructure implementation. When a server requires a patch or update, it is decommissioned, and a new server is deployed. Rather than logging into the server via SSH and updating its software, each change to the application involves pushing code to a Git repository (i.e., Git push).
The drawback of not allowing changes to the infrastructure is the lack of certainty about the deployed system’s state. However, immutable infrastructure is more reliable, predictable, and consistent. It simplifies the software development process and facilitates operations, preventing issues common in mutable infrastructure.
The key to maintaining visibility into the infrastructure’s state is frequently replacing servers using version-controlled, known configurations. Thus, each update resets the infrastructure to a known state to avoid configuration drift. Every configuration change starts with a documented, verified push to the Git code repository.
Immutability allows organizations to remove SSH access to prevent undocumented and manual fixes. This makes the deployed infrastructure safer and reduces the risk of complicated ad-hoc setups that result in downtime.
GitOps
GitOps helps modernize software operations and management, allowing developers to manage code and infrastructure with a declarative approach. Typically, a Git repository is the single truth source to keep applications in sync.
GitOps tools can help manage configuration drift by establishing a source of truth (i.e., Git) across all software deployments. In standard DevOps workflows, understanding what caused a failed deployment can be challenging because the CI/CD pipeline does not control changes. Developers can also use Git tools to track the deployment history so teams know who deployed each change and when.
In a GitOps environment, all deployments are traceable to a single Git repository, with the commit history in Git serving as a deployment history. It forces updates through an approved, unified CI/CD channel. When the GitOps tool detects a difference between the live cluster’s state and the Git manifest’s desired state, it marks the application as “out of sync” and reverts it to the desired configuration, thus preventing configuration drift.
Related content: Read our guide to GitOps vs Devops
Preventing Configuration Drift with Aqua
Aqua can prevent containers drifting away from their original purpose. By definition, cloud-native workloads are immutable. This immutability principle is heavily leveraged for defining the cloud-native security strategy in the form of drift prevention.
Drift prevention ensures that your containers remain immutable, and protects you from both malicious attacks and bad habits by not allowing executables to run that were not part of the original image and/or not allowing the container to run when image parameters have changed.