Terraform State Management: Lessons from Production Incidents

Terraform state is the source of truth for your infrastructure. It maps your configuration to real resources, tracks dependencies, and determines what changes are needed on each apply. Mismanaging state is one of the fastest ways to corrupt your infrastructure.

I learned this the hard way. Here's what I know now.

Why Remote State Matters

Early in my Terraform journey, I used local state files. It worked fine until a colleague and I ran terraform apply simultaneously. The resulting state corruption took hours to untangle, and we had to manually reconcile resources in the AWS console.

Remote state with locking solves this completely. For AWS, I use S3 with DynamoDB:

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "prod/infrastructure.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-locks"
  }
}

The DynamoDB table provides locking - only one operation can run at a time. The S3 bucket stores state with versioning enabled for recovery.

Setting Up the Backend

I create the backend infrastructure separately, usually manually or with a bootstrap script:

resource "aws_s3_bucket" "terraform_state" {
  bucket = "mycompany-terraform-state"
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-state-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

Versioning is critical. It's saved me multiple times when state got corrupted or accidentally modified.

Handling Lock Issues

When someone else is running Terraform, you'll see:

Error: Error acquiring the state lock

Lock Info:
  ID:        abc123def456
  Operation: OperationTypeApply
  Who:       jenkins@ci-runner-5

Usually, just wait for the other operation to complete. But if a CI job crashed mid-apply, you'll need to force unlock:

terraform force-unlock abc123def456

Be careful. I once force-unlocked while a colleague was still running an apply. The resulting state corruption required manual cleanup. Always verify no operation is running before force unlocking.

State Commands I Use Regularly

Listing resources:

terraform state list
terraform state list 'module.database.*'

Viewing resource details:

terraform state show aws_instance.web

Moving resources (when refactoring):

terraform state mv aws_instance.web module.compute.aws_instance.web

Removing from state (when moving to manual management):

terraform state rm aws_instance.legacy

After any state manipulation, I immediately run terraform plan to verify no unintended changes.

Importing Existing Resources

When taking over existing infrastructure, import brings resources under Terraform management:

terraform import aws_instance.web i-0abc123def456789

Then run terraform state show to see current attributes and update your configuration to match. The goal is terraform plan showing no changes.

Terraform 1.5+ has import blocks which I prefer:

import {
  to = aws_instance.web
  id = "i-0abc123def456789"
}

Recovering from Disasters

Corrupted state: S3 versioning saves you. List previous versions and restore:

aws s3api list-object-versions \
  --bucket mycompany-terraform-state \
  --prefix prod/infrastructure.tfstate

aws s3api get-object \
  --bucket mycompany-terraform-state \
  --key prod/infrastructure.tfstate \
  --version-id "abc123" \
  recovered.tfstate

terraform state push recovered.tfstate

Lost state without versioning: This is painful. You'll need to import every resource manually. I've done this once - it took a full day for a moderately complex environment. Enable versioning.

Stuck lock: Check DynamoDB directly:

aws dynamodb get-item \
  --table-name terraform-state-locks \
  --key '{"LockID": {"S": "bucket/key.tfstate"}}'

Handling State Drift

Drift happens when someone changes resources outside Terraform. I detect it during regular plans:

terraform plan
# Shows unexpected changes

Options:

Accept the drift - Update your configuration to match reality
Revert the drift - Apply to restore desired state
Refresh only - When drift is in attributes you don't manage

terraform apply -refresh-only

Mistakes I've Made

Running apply without a plan review. I changed a security group rule that broke production traffic. Now I always review plans, even for "simple" changes.

Force unlocking during an active operation. Cost me hours of manual reconciliation. Always verify the operation is truly abandoned.

Not enabling versioning from day one. Lost state on an early project. Recovery was manual and painful. Now it's the first thing I configure.

Storing sensitive data in terraform.tfvars. Accidentally committed database passwords. Now I use SSM Parameter Store exclusively for secrets.

Key Takeaways

Always use remote state with locking - S3 + DynamoDB for AWS is the standard
Enable versioning immediately - It's your disaster recovery lifeline
Never force-unlock without verification - Confirm no operation is running
Backup before state manipulation - terraform state pull > backup.tfstate
Plan after every state operation - Verify no unexpected changes
Keep secrets out of state files - Use SSM Parameter Store or Secrets Manager