Infrastructure as Code and Continuous Delivery - A happy couple?
Hesitant to use Continuous Delivery for your infrastructure? Let's find out why and see how to fully embrace the combination of Infrastructure as Code and Continuous Delivery.
In today’s fast-paced world of software development, the concepts of Infrastructure as Code (IaC) and Continuous Delivery (CD) have become paramount. IaC represents the practice of managing infrastructure through code, while CD is the approach of delivering software in a rapid and automated fashion. Together, they form the foundation of a modern, agile development pipeline. However, despite their potential benefits, many organizations are hesitant to fully embrace them together. They practice CD for their application software and IaC to manage their infrastructure, but they don’t use CD for their infrastructure. Weird, right? In this blog post, we’ll explore what IaC and CD are, delve into the reasons behind the hesitation, and discuss countermeasures to address those concerns.
What Is Infrastructure as Code and Continuous Delivery ?
Infrastructure as Code is a methodology that treats infrastructure provisioning, management, and configuration as code. Instead of manually setting up servers, networks, and other infrastructure components, developers use declarative or imperative code to define the desired state of their infrastructure, i.e. in a declarative manner. Popular tools like Terraform, AWS CloudFormation, and Ansible enable teams to automate infrastructure deployment, making it more predictable, scalable, and efficient.
Continuous Delivery means automating the entire software delivery process, from code commit to delivering to the customer. This could mean deploying a SaaS service to the production environment or publishing an App to the official app stores. CD aims to reduce manual intervention and increase the speed and reliability of software releases, ensuring that new features, bug fixes, and improvements are delivered to end-users as quickly as possible. A key part of CD is trunk-based development, meaning you will either have no or only short-lived feature branches in your source code repository. Commits are done frequently to the main branch, and this triggers a deployment into the production environment. Each commit is tested thoroughly in your pipeline, so you have build-in QA and don’t need to rely on external testing.
Why the Hesitation?
Many teams are hesitant to automatically apply changes to their infrastructure within their pipelines. The fear is to ‘break production’, causing downtimes for the customers and requiring immediate remediation. Let’s look into it in more detail. Here are some common concerns and viable countermeasures:
1. Lack of infrastructure knowledge
In many cases, teams have developed application code for a long time and then turned into ‘DevOps’ teams with responsibility for their infrastructure and deployments. While this is the right thing to do, infrastructure management might not feel like their ‘home turf’ and maybe there is also a knowledge gap compared to programming in a familiar language. This leads to a bigger fear of breaking production with infrastructure changes. As a result, manual approvals are set in place instead of a fully automated process.
Countermeasure 1.1 : Training and coaching This sounds like a no-brainer, but it’s sometimes forgotten: It might not be easy for seasoned developers to go back into ‘beginner mode’ and embrace something new. They might even hide the fact that a knowledge gap exists. From my perspective, a combination of training and coaching is needed to bring them up-to-speed and make them comfortable with this new piece of technology: Training is done to pass on necessary general knowledge via presentations, workshops and other means Coaching means a person from outside (‘the coach’) helps the team apply the knowledge to their real-life setup, and provide best-practices and tips
Countermeasure 1.2 : Conduct changes as small as possible Breaking a bigger change into several smaller ones (also known as ‘reducing the batch size’) has a few positive effects - The most prominent one is that the risk of every single change is decreased, and it’s easier to understand what went wrong in case something unexpected happens. But also, you harden your CD process by running it more often - And you will get more and more trust into it until it becomes a ‘no-op’ for you.
2. Lack of testing
Ensuring that infrastructure changes won’t negatively impact existing systems is a complex challenge. Many organizations struggle with comprehensive testing strategies for their infrastructure to validate the changes made through IaC, making them reluctant to fully embrace automation. They rather rely on manual testing on a staging environment - this manual step increases the time to deploy to production drastically.
Countermeasure 2.1 : Use of a staging system with automated End-to-End Testing: Implement a staging environment that mirrors your production infrastructure. This allows you to test changes in a controlled environment before deploying to production. Automate end-to-end testing to validate the functionality of your infrastructure. A staging environment only violates the principles of CD if it requires a manual testing step which it shouldn’t. You can easily deploy to staging, run tests and deploy to production in one pipeline. A variation of this is to use a continuous health check instead of a one-time test. Tools like Pingdom enable you to periodically check the health of your system from the outside, and you can apply this to both staging and production.
Countermeasure 2.2 : Use tools to deploy and decommission your infrastructure automatically Alternatively, you can also use a tool like Terratest. Terratest creates infrastructure, runs tests against it and then removes it again. It is meant for automation, so can be integrated very easily in your CD pipeline. Just note that creating a system from scratch is not what you are doing in production, you are rather doing infrastructure updates. This might or might not make a difference.
3. Deletion of critical data
Cloud resources can hold critical data, for example in databases. When they are accidentally deleted, it is not sufficient to recreate them. One must ensure that the actual data is being recreated as well, using backups or alike. Potential data loss is always a hot topic.
Countermeasure 3.1 : Be aware of cloud providers’ resource deletion policies
Understand the resource deletion policies of your cloud provider. Cloud platforms typically have mechanisms in place to prevent the accidental deletion of non-empty resources. In AWS, for example, you can only delete S3 buckets if they are empty. You need to run a dedicated ‘empty bucket’ command or delete all objects manually before deleting the actual bucket. In regards to RDS databases, deletion protection is turned on by default. That means you need to turn it off manually before you are able to delete the database.
Countermeasure 3.2 : Prevent accidental deletion with Terraform’s prevent_destroy option
If you’re using Terraform, take advantage of the prevent destroy option to protect critical resources. When enabled, Terraform will not execute any commands that would lead to the deletion of the specified resources, providing an additional layer of safety. This obviously includes the ‘destroy’ command, but also any change to the infrastructure that would require a re-creation of the resource. Final notes - Make a conscious decision where to invest
If you don’t have CD in place, you will always have an unnecessary process overhead - This could be the use of a staging branch (including multiple merges, pipeline runs, etc.), manual testing efforts and alike. All this is continuous effort.
If you rather invest in implementing a simple but robust CD pipeline using the safety measures mentioned above, you will be able to treat your infrastructure code in the same way as your application code, and bring changes into production much quicker.