My issues with terraform
Gripes on Terraform in the context of Amazon Web Services
Published: Thursday, Nov 17, 2022 Last modified: Thursday, Nov 14, 2024
Terraform language (HCL) is painful
0.12 comes to mind as an awkward upgrade.
Gotchas like you can’t use variables in init stanzas leads teams to adopt terragrunt which can add a lot of complexity for basic functionality.
The future of IaC looks like using familiar bindings in common languages like Javascript and Go via CDK. No one enjoys HCL. HCL tooling is lacking.
Developers hate HCL and hence Devops / SRE is born: A developer who endures HCL and slow iterations.
Terraform state is non-trivial to manage
If you rename a stack, you must remember to rename the state file or location else your infrastructure will be broken.
Don’t forget to lock access to state when working in a team and if you have
timeouts, prepare to run terraform force-unlock
manually out of your pipeline.
Blast radius
Most organisations split their IaC arbitrarily into different stacks. The more stacks, the more difficult to deploy the complete infrastructure or just one service if someone decided to split it up. Painful.
Refactoring terraform of someone’s overzealous splits to make things sensible is even more painful.
Time outs
If you’re waiting for infrastructure to provision and there is some break in connectivity or operation, your state file will be inconsistent.
You then need to do manual import actions to repair the situation.
This easily happens when provisioning modern infrastructure (cough Kubernetes) takes well over twenty minutes.
Serverless is super complex in Terraform
Cloudformation has AWS::Serverless-2016-10-31
and Terraform requires you to
manually setup everything.
- the log group
- execution role
- IAM awkwardness aws_iam_role_policy_attachment
- the zip archive process
- the s3 bucket
- the networking
Update: You might want to try serverless.tf
Helm (Kubernetes) doesn’t make sense with Terraform
Using helm_release is awkward with Terraform.
With Kubernetes objects (YAML) you are better off with a Gitops workflow using ArgoCD.
Two IaC git repos is very common; org-{tf,k8s}. Do not mix k8s declarations with terraform!
Drift
The Terraform AWS provider is not perfect. AWS might change a value and then your plan is inconsistent.
This is called drift and it’s hard to manage across stacks and environments.
Drift is also when someone changes something manually and doesn’t update the code, and it’s hard to tell them apart from provider updates.
Manual remediation of mismatching actual and desired state across stacks is unsustainable. Better ideas are Gitops workflows and higher level constructs like Cloudformation, which try to enforce IaC intent.
Destroy
You can detect destroys like so:
terraform show --json tfplan | jq -r '.resource_changes[].change.actions[]' | grep -q delete && exit 2 || exit 0
Nonetheless there are times when you might need to refactor/rename something and a destroy will be treated carefully in any production environment.
Often a destroy has led to a down service or worse, lost data, and it’s difficult to know the impact (without a lot of experience) and roll back.
Poor support
If you have a problem with AWS Terraform, will AWS help? No, they will expect you to use Cloudformation and be able to see your stack (see “Not easy to debug”).
The forum is hit and miss and good luck on Github.
How do you get support from Hashicorp? Deploying Terraform Enterprise or foisted on the Terraform Cloud which isn’t an option for some highly regulated companies.
Bootstrapping
It’s typical you need to bootstrap each AWS account with a state bucket and dynamodb table.
This manual / “one off” step is difficult to automate.
Not easy to debug
AWS resources often are interdependent and Terraform doesn’t know about these
complex relationships. Getting depends_on
correct is non-trivial. And when it
fails you don’t know why because you can’t see the stack’s events.
Provisioning trivial stacks creates 100s of events!
When any event goes wrong you don’t know the reason why and act upon it. That’s because you’re using Terraform.
AWS’s native tooling Cloudformation rolls back by default when provisioning goes wrong or at least gives a clear indication of why something failed. With terraform you will be left to picking up the pieces when changes inevitably go wrong.
Re-run plan, apply and pray
When things go wrong, the plan is inconsistent. The typical solution is to re-run. It sounds silly, but it actually happens where you need to re-run a {plan,apply} 3-4x times for AWS to provision a complex shared networking infrastructure! Especially for “Pending Acceptance” cases for RAM.
This might be avoided with depends_on
, though as mentioned this is really not
easy to get right without knowing the details of the underlying Cloud service.
Dumb and Slow
Terraform tries to create a resource and polls until it exists unlike native tooling which is aware of the events and dependencies.
You won’t have a good idea of progress and often you will hit timeouts on complex Kubernetes stacks. Especially if you do things like manipulate aws_auth_roles, which will only work when the EKS is up after ~20 minutes.
Upgrade issues
Upgrading terraform and the provider lock files are not the same as Node’s package-lock.json or go.mod. Watch out!
Terraform is constantly changing and it’s difficult to find a stable route to upgrade. As a result many organisations use old terraform versions that manifest the problem!
Modules misunderstood
The beauty of using terraform modules is thought to be re-using existing
third-party solutions. In practice, off the shelf
modules are a leaky
abstraction and introduce
dependencies that are difficult to manage (Hint: .terraform.lock.hcl
won’t help!).
The real reason to use modules is organisational compliance.
Do not think of opensource Terraform community modules as being plug and play components to solve your infrastructure gaps. They are often generic, bloated and won’t meet your organisation’s (compliance) requirements.
Conclusion
Terraform makes it harder to provision, debug and refactor IaC than using native tooling.
The future is probably a mix of something like aws cloudformation deploy
after using CDK to generate the YAML, and
managing your Kubernetes with a Gitops workflow.
Gone is Terraform’s beloved plan and apply, and naive polling to ensure resources are there. Infrastructure is no longer some static independent resources, which Terraform served well.
💬 Comments