cremich.cloud — ~/blog/aws-cdk-deployed-to-the-wrong-region:-the-agnostic-stack-trap

cat aws-cdk-deployed-to-the-wrong-region:-the-agnostic-stack-trap.md

AWS CDK Deployed to the Wrong Region: The Agnostic Stack Trap

A red pipeline, and no time to chase it

It was an ordinary sprint day when our INT pipeline went red. The deploy stopped partway through with a CloudFormation error:

Unable to fetch parameters [/my-app-int/database,/my-app-int/network] from parameter store for this account.

We had a sprint goal in front of us, a board full of committed work, and a deployment that suddenly refused to go through. And here is the honest part: we did not know where the problem was. The error pointed at Parameter Store, the parameters clearly existed, and nothing lined up.

Under sprint pressure, this is the exact moment a team quietly loses a day. Pulling two or three engineers off the goal to spelunk through CloudFormation is the kind of context switch I try hard to protect people from. But a broken deploy does not wait for your sprint review either.

The error that lied

The parameters existed. I checked. Both /my-app-int/database and /my-app-int/network were sitting in eu-central-1 with correct StringList values. So why would CloudFormation claim it could not fetch them?

That is the trap. The message says “parameter store”, so you go and stare at Parameter Store. You verify the values, the types, the permissions, and everything looks fine, which only deepens the confusion. The error was telling the truth about the symptom and lying about the cause.

Handing the investigation to Kiro

Instead of pulling the team off the goal, I decided to offload the whole investigation. I gave Kiro what it actually needed to be useful here: the AWS Agent Toolkit with the AWS MCP server, so it had real read access to our AWS account, plus the CDK and CloudFormation skills.

That single step, giving the agent real visibility into the live environment instead of just the error string, is what changed everything. Kiro did not guess. It went and looked.

It started where I would have, suspecting an IAM gap on the pipeline’s deploy role. Then it checked that hypothesis against reality, saw that both the pipeline’s OIDC role and the CloudFormation execution role had sufficient permissions, and discarded its own first theory. Then it did the thing that actually cracked the case: it compared the synthesized output of the failing stacks against the healthy ones.

What “environment” really means to CDK

To understand what it found, you need one CDK concept: a stack’s environment. In CDK, a stack’s environment is the concrete account and region it is bound to, set through env: { account, region } in the stack props.

Set it, and the stack is environment-specific: CDK knows exactly where it will live while it is still synthesizing. Leave it out, and the stack is environment-agnostic: it has no home until deploy time, when CDK resolves the target from whatever ambient AWS configuration the CLI happens to see. The CDK documentation spells out both the definition and the risk:

If either region or account are not set nor inherited from Stage, the Stack will be considered “environment-agnostic"". Environment-agnostic stacks can be deployed to any environment but may not be able to take advantage of all features of the CDK.

Source: https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.StackProps.html

When you provide environment information in this way, you can use environment-dependent code and logic within your CDK app. This also means that the synthesized template could be different, based on the machine, user, or session that it’s synthesized under. This approach is often acceptable or desirable during development, but is not recommended for production use.

Source: https://docs.aws.amazon.com/cdk/v2/guide/configure-env.html

Most of the time you get away with it, because your laptop has a default region and it usually happens to be the right one. That is exactly why this class of bug is so quiet: it works on every developer machine and only misbehaves where the ambient region is different, or missing.

In a reliable CI/CD setup, that is exactly what you do not want.

What synth actually wrote down

Every CDK deploy goes through synth first. Synth produces the cloud assembly in cdk.out: the CloudFormation templates plus a manifest.json that tells the CLI, for each stack, where to deploy it and which roles to use. This is where the bug stopped being a mystery.

For every healthy stack, the manifest had a concrete environment, aws://111122223333/eu-central-1, and a fully resolved cloudFormationExecutionRoleArn. For the two failing stacks, the manifest said "environment": "aws://unknown-account/unknown-region", and their role ARNs still contained unresolved ${AWS::AccountId} and ${AWS::Region} tokens. Same code base, same synth run, two stacks that had quietly opted out of knowing where they belonged.

Those leftover tokens are the tell. When CDK has a concrete environment, it substitutes real values at synth time. When it does not, it leaves the placeholders in and defers the decision to deploy time. So the manifest was, in effect, a signed confession: two stacks with no account and no region, waiting for the CLI to fill in the blank later.

Why the deploy went ahead anyway

At deploy time, CDK treats these two categories of stack very differently.

For the environment-specific stacks, the region is baked into the assembly, so they went to eu-central-1 regardless of anything the CLI environment said. For the two agnostic stacks, CDK had to resolve a region from the ambient configuration chain: AWS_REGION, then AWS_DEFAULT_REGION, then the profile, and so on. Our Bitbucket pipeline authenticates with AWS through OIDC and sets AWS_ROLE_ARN and the web-identity token, but it never sets a region. And when nothing in that chain is configured, the CDK CLI falls back to its default region: us-east-1.

So the two agnostic stacks aimed themselves at us-east-1. The real question is why that did not fail instantly. It should have hit a wall the moment it tried to find its deployment roles.

It did not, because us-east-1 happened to be bootstrapped with the same default qualifier, hnb659fds. Bootstrapping provisions a whole contract of resources in a region: the deploy role, the CloudFormation execution role, the S3 asset bucket, and the /cdk-bootstrap/hnb659fds/version parameter the CLI checks before it does anything else. All of that existed in us-east-1, so the CLI’s pre-flight checks passed. It assumed the us-east-1 deploy role, published the Lambda asset to the us-east-1 asset bucket, and asked CloudFormation to create the changeset there. Every step looked healthy.

The last step is where it finally broke. The template declared its inputs as AWS::SSM::Parameter::Value<List<String>> parameters, and CloudFormation resolves those at changeset creation time, in the region it is running in. Our parameters live only in eu-central-1. In us-east-1 they simply do not exist, so resolution failed with that misleading “unable to fetch parameters” message, four stacks into a deploy that had otherwise looked perfectly normal.

CloudTrail nailed it down precisely: the failing CreateChangeSet calls were in us-east-1, and a GetParameters there returned both names as invalid, while eu-central-1 returned them fine. One pipeline session had even fanned out across both regions inside the same eleven-second window, sending the environment-specific stacks to eu-central-1 and the two agnostic ones to us-east-1.

There was a quieter casualty too. The second agnostic stack had no cross-region SSM dependency, so it did not fail at all. It deployed cleanly into us-east-1 and is still sitting there as a phantom, a real stack running in a region we never intended to touch.

The easy fix

After all that, the bug itself was almost anticlimactic. Two stack constructors called super() without forwarding their props:

// before: env is silently dropped, the stack becomes environment-agnostic
super(scope, id, stage);

// after: env flows through, the stack resolves to eu-central-1
super(scope, id, stage, props);

What makes this snippet matter is not the syntax but what props carries: env: { account, region }. Our app was passing it in correctly, and every other stack forwarded it. BaseApiStack, the shared base class, accepts those props as its last argument. These two constructors just dropped them on the floor, and that one omission was enough to strip the environment off the stack, turn it agnostic, and quietly send it to us-east-1. After the fix, cdk synth emits both stacks as aws://111122223333/eu-central-1, and the changeset lands where the parameters live.

What actually changed

If I stopped at the fix, the takeaway would be “the AI was faster.” That is not the point, and it undersells what happened. The whole time from red pipeline to confirmed root cause was roughly 60 to 90 minutes of Kiro session, while the team kept moving toward the sprint goal the entire time.

The point is that a genuinely important but not sprint-critical investigation got done, thoroughly and with evidence, without any of us stepping off the goal to do it. A well-equipped agent is not a faster autocomplete. It is something you can delegate a real problem to, as long as you give it a proper harness: access to the environment, the right tools, and the room to check its own assumptions against reality. The turning point came when I installed the AWS MCP server and let Kiro read the actual account.

We also came out with more than a patch. Pin AWS_DEFAULT_REGION in the pipeline so an agnostic stack can never silently escape to another region. Add a synth-time assertion that every stack has a resolved account and region, and fail synth otherwise, so a dropped env gets caught before it ever reaches a deploy. And be suspicious of a same-qualifier bootstrap in a region you never deploy to, because that is precisely what turned what should have been a loud, early failure into a quiet, wrong-region one.

So here is what I keep turning over. We are getting good at deciding what to build and what to automate. Are we as deliberate about what to delegate, and about giving our agents enough real access to earn that trust?

cd ..