Skip to main content

Command Palette

Search for a command to run...

Cost-Effective Ephemeral Environments with Selective Deployment

Updated
15 min read
Cost-Effective Ephemeral Environments with Selective Deployment

Ephemeral environments have become one of the most popular ways to speed up testing and improve developer experience. The idea is simple: whenever you open a pull request (PR), a temporary environment spins up that mirrors production. You can test your changes in isolation, validate them with QA, and share a live URL with teammates — all before merging to main.

But there’s a catch.

The most common way teams implement ephemeral environments is by provisioning everything from scratch for each PR: a dedicated VPC, database, load balancer, and all microservices. This ensures full isolation, but it also comes with serious downsides:

  • High cost – duplicating the entire stack for every PR quickly adds up, especially for microservice-heavy architectures.

  • Slow feedback loops – full environment provisioning can take a long time, delaying reviews.

  • Operational overhead – managing and cleaning up all those environments is painful.

When we set out to adopt ephemeral environments at Rewaa, we knew the full-stack approach wouldn’t work for us. Our infrastructure is large and cost-sensitive, and we couldn’t justify spinning up a copy of every service for every PR.

So, we asked ourselves:

👉 What if we could keep the benefits of ephemeral environments while avoiding the cost explosion?

That’s how we ended up designing a cost-effective, selective deployment approach — one where we reuse shared infrastructure and only spin up the services that actually change in a PR.

Our Approach: Shared Infra + Selective Deployment

When we looked at existing patterns for ephemeral environments, two models stood out:

  1. Full Infrastructure Deployment

    Each PR provisions its own copy of the entire stack — VPC, database, queues, and all microservices.

    • ✅ Pros: Full isolation, minimal risk of cross-environment interference.

    • ❌ Cons: Extremely expensive, slow to provision, and hard to manage at scale.

  2. Selective Deployment (Our Approach)

    Each PR reuses shared infrastructure — the same cluster, VPC, database, and default services — while only deploying the services that have actually changed in that PR.

    • ✅ Pros: Cost-efficient, faster spin-up, minimal duplication.

    • ❌ Cons: Less isolation, needs careful handling of routing and environment overrides.

We decided early on that full duplication was not sustainable. The cost alone would have outweighed the benefits of ephemeral environments. Instead, we designed a selective deployment strategy:

  • Shared baseline infrastructure:

    All services, databases, VPCs, and queues remain running in shared clusters.

  • PR-specific overrides:

    When a PR modifies a service, we launch a new container for that service only. The load balancer dynamically rewrites rules so that requests from the PR subdomain route to the PR-specific container.

  • Fallback mechanism:

    If a service isn’t changed in the PR, traffic is simply routed to the shared version of that service.

This way, we only pay for what’s changing while still giving each PR a dedicated URL that looks and behaves like its own environment.

Example:

https://pr1.example.com/service1   → PR-specific service
https://pr1.example.com/service2   → Falls back to shared service2

This balance between isolation and efficiency allowed us to scale ephemeral environments without burning through cloud spend — and without slowing down developers.

Architecture Overview

To make selective deployment possible, we designed an architecture that routes requests dynamically based on the PR subdomain. Instead of duplicating infrastructure, we share the base resources and only plug in PR-specific services where needed.

At a high level, the flow looks like this:

   ┌─────────────────┐
   │   Route53       │
   │  (*.example.com)│
   └──────┬──────────┘
          │
   ┌──────▼─────┐
   │ CloudFront │
   └──────┬─────┘
          │
   ┌──────▼─────┐
   │    ALB     │
   │ (Routing)  │
   └──────┬─────┘
          │
   ┌──────▼────────────────────────────┐
   │          ECS Cluster               │
   │                                    │
   │ Shared Services  ──────────────► Default Target Groups │
   │ PR-Specific Tasks ─────────────► PR Target Groups     │
   └──────────────────────────────────┘

   ┌───────────────────────────────┐
   │         S3 + CloudFront       │
   │ Shared frontend + PR builds   │
   └───────────────────────────────┘

How it Works

  1. Wildcard Routing

    • A Route 53 wildcard record (*.example.com) routes all PR subdomains into CloudFront.
  2. CloudFront → ALB

    • CloudFront forwards API traffic (/api/*) to an Application Load Balancer (ALB).

    • The ALB holds dynamic routing rules that map PR subdomains to PR-specific target groups.

  3. ECS Cluster & Services

    • Each microservice runs as an ECS service in the shared cluster.

    • When a PR needs a custom version of a service, we spin up just that service in a new target group.

    • All unchanged services continue to use the shared deployment.

  4. Frontend Handling

    • All frontend builds (default + PR builds) are stored in a single S3 bucket.

    • CloudFront viewer functions rewrite requests based on the PR subdomain, serving the correct version of the frontend.

This design gives each PR a dedicated URL that looks like a complete environment, but under the hood it’s mostly shared infra with selective overrides.

Backend Ephemeral Environments

The backend was the trickiest part of making selective ephemeral environments work. We wanted every PR to behave like its own isolated stack without duplicating all the services. To achieve that, we leaned heavily on Application Load Balancer (ALB) routing rules and PR-specific target groups in ECS.

1. Target Groups & Task Provisioning

  • For every PR, a new Target Group (TG) is created.

  • Inside that TG, we launch a single ECS task running the microservice image built from the PR branch.

  • This keeps resource usage minimal — you only pay for one extra task per modified service.

2. ALB Rule Updates

  • The ALB listens for incoming requests from PR subdomains.

  • For each provisioned service, a new rule is added:

Host header: pr1.example.com
Path:        /service1
Forward to:  PR1-service1-TG
  • If service1 was changed in PR #1, all traffic for /service1 on that PR’s subdomain is routed to the PR-specific task.

3. Fallback Mechanism

  • If a service wasn’t provisioned for a PR, the ALB falls back to the default shared target group.

  • This ensures PRs don’t need to redeploy the whole stack — only the changed pieces.

Example:

https://pr1.example.com/service1   → PR-specific service1  
https://pr1.example.com/service2   → Shared service2 (fallback)

4. Multiple Backend Services

When multiple services are provisioned for the same PR:

  • Each service gets its own PR-specific task and target group.

  • Downstream URLs inside the environment are overridden so that service-to-service calls also respect the PR subdomain.

Example: if service1 depends on service2, the service2_host variable in service1 is set to:

https://pr1.example.com/service2

This ensures internal traffic flows correctly within the PR environment.

By combining dynamic ALB rules, lightweight ECS tasks, and PR-specific overrides, we gave each pull request its own “environment” — without spinning up an entirely new cluster or database.

Frontend Ephemeral Environments

On the frontend side, we followed the same principle: reuse shared infrastructure and selectively deploy only what changes for each PR. Instead of creating new buckets or distributions per environment, we designed a lightweight approach using a single S3 bucket and CloudFront viewer functions.

1. S3 Bucket Structure

  • A single S3 bucket hosts all frontend builds.

  • The default (main branch) frontend is deployed to a /default folder.

  • PR-specific builds are deployed into subfolders (e.g., /pr1/).

Example structure:

/default/index.html
/pr1/index.html
/pr2/index.html

2. CloudFront Viewer Functions

CloudFront inspects the Host header of each incoming request and rewrites the URI to the correct S3 folder:

This means every PR automatically gets its own frontend, available at a PR-specific subdomain.

3. Unified Experience

  • The PR frontend talks to backend services using the same PR subdomain.

  • For example, API calls from https://pr1.example.com will automatically be routed to:

https://pr1.example.com/service1
https://pr1.example.com/service2
  • This makes the frontend + backend integration seamless, and reviewers can test a PR end-to-end with a single URL.

4. Cleanup After Testing

When a PR is closed or merged:

  • The PR folder in S3 is deleted.

  • CloudFront cache is invalidated.

  • The subdomain (pr1.example.com) automatically stops serving content.

This ensures resources don’t linger and costs remain minimal.

By using one bucket and dynamic request rewriting, we avoided the complexity of provisioning new CloudFront distributions or buckets per PR, while still giving each PR a fully functional frontend.

Provisioning Workflow (CDK + Step Functions)

Spinning up an ephemeral environment is more than just deploying tasks and updating routes — it needs to be automated, fast, and repeatable. To achieve that, we built the provisioning workflow using AWS CDK and Step Functions, triggered directly from our Slack bot.

This allowed developers to request a PR environment with a single click and get back a live URL in minutes.

1. Triggering the Workflow

  • A developer opens a pull request.

  • From Slack, they request an ephemeral environment.

  • The Slack bot invokes a Step Function execution with a payload like this:

{
  "commitId": "abcd1234ef567890",
  "servicesToProvision": ["service1", "service2"],
  "provisionQueues": true,
  "environmentVariablesOverrides": {}
}

Generated URL:

https://pr1.example.com

2. Step Function Workflow

The Step Function orchestrates provisioning in well-defined steps:

  1. Extract Configurations

    • Load networking info, routing rules, and pipeline definitions.
  2. Invoke CI/CD Pipelines

    • For each service listed in servicesToProvision, the Step Function triggers an EphemeralPipeline.

    • The EphemeralPipeline reuses the same build steps as the MainPipeline (so there’s no drift).

  3. Provision Backend Services (ECS Fargate)

    • Create ALB Target Groups.

    • Launch ECS tasks with PR-specific images.

    • Register tasks in the TG.

  4. Update ALB Rules

    • Add routing rules for the PR subdomain → service path → PR TG.
  5. Deploy Frontend

    • Upload PR build to S3 under /pr1/.

    • CloudFront viewer functions handle routing.

  6. Wait for Testing

    • Environment stays alive while the PR is open.
  7. Cleanup After Testing

    • Stop ECS tasks, delete ALB rules and TGs.

    • Remove PR frontend folder from S3.

    • Invalidate CloudFront cache.

3. Main vs Ephemeral Pipelines

We designed two pipelines for each service:

  • MainPipeline:

    • Runs automatically on merges to main.

    • Builds and deploys production-ready artifacts.

  • EphemeralPipeline:

    • Invoked only by the Step Function.

    • Uses the same build stages as MainPipeline (to avoid drift).

    • Builds PR-specific images or frontend bundles and registers them under the PR environment.

4. Parallel Deployments

The servicesToProvision array is processed in parallel by the Step Function, so multiple services can be provisioned at once. For example:

  • PR #1 modifies both service1 and service2.

  • Step Function launches EphemeralPipelines for both.

  • Each service gets a PR-specific ECS task and routing rule.

By automating the entire process with CDK, Step Functions, and Slack, we turned environment provisioning into a self-service workflow: developers request it → pipelines build it → Step Function wires it up → Slack posts back the PR URL.

Developer Experience

A key goal of our ephemeral environments project was to make it developer-friendly. We didn’t want engineers juggling scripts or remembering arcane commands — the experience had to feel natural and integrated into our daily workflow. That’s why we wired everything into Slack.

Developers can manage environments with three simple commands:

  • /provisionenv – Create a new environment.

  • /cleanupenv – Tear it down.

  • /extendenv – Keep it alive longer.

All provisioning and cleanup is fully automated under the hood by Step Functions and pipelines, but from the developer’s perspective it’s just a Slack modal and a notification.

Provisioning a New Environment

  1. Type /provisionenv in Slack.

  2. Fill out a short form:

    • Commit ID – The Git SHA you want to deploy.

    • Services to Provision – Choose one or more services (service1, service2, frontend, etc.).

    • Optional flags

      • Provision queues (if testing SQS-based flows).

      • Override environment variables (via JSON).

  3. Click Provision, and Slack immediately posts back a live URL like:

     https://pr1.example.com
    

    From that moment, reviewers, QA, and product managers can use the PR-specific frontend and backend services.

Cleanup and Lifecycle Management

  • Default lifespan: Each environment lives for 60 minutes.

  • Extend: If you need more time, /extendenv lets you add minutes or hours. Example:

      /extendenv
      Commit ID: abcd1234ef567890
      Extension: 120 minutes
    
    • Cleanup: Once you’re done testing, run /cleanupenv to free resources. This tears down the ECS tasks, ALB rules, S3 frontend folder, and queues.

Notifications for all of these actions appear in a dedicated Slack channel, so developers know when their environment is ready or when cleanup has finished.

Multi-Tenant Database & Feature Flags

Since environments reuse the base database, we rely on multi-tenancy to keep data isolated:

  • Each tester uses their own tenant or account in the shared DB.

  • Ephemeral SQS queues are namespaced by PR to prevent collisions.

  • Feature flags are connected to the shared staging environment in LaunchDarkly, so developers can toggle flags per user or per test account when needed.

Why Slack?

  • Familiarity – Developers already live in Slack.

  • Speed – No need to context-switch to AWS console or CI/CD dashboards.

  • Visibility – Notifications in team channels keep everyone aware of which PRs are being tested.

This workflow makes ephemeral environments effortless to use: open a PR, provision from Slack, test your changes at a unique URL, and clean it up when you’re done.

Impact

Implementing ephemeral environments with our selective deployment approach had a noticeable impact on both cost and developer productivity. Instead of simply adding a shiny feature, this change directly improved how our teams build, test, and ship software.

1. Cost Savings

  • Full-stack duplication would have meant spinning up a complete copy of our infrastructure — databases, VPCs, load balancers, and dozens of microservices — for every PR.

  • By contrast, our selective approach only provisions the services that change.

  • Result:

    • We avoided the massive cost of running idle resources.

    • Ephemeral environments became affordable enough to be used daily, even across multiple teams.

2. Faster Feedback Loops

  • Spin-up time dropped significantly: provisioning an environment now takes minutes instead of hours.

  • Developers don’t need to wait for a full stack to boot; they just deploy the affected services.

  • This enabled faster QA handoffs and quicker product reviews.

3. Improved Developer Experience

  • With Slack integration, provisioning became a self-service workflow.

  • No more manual deployments, no waiting for DevOps engineers.

  • PRs now include a live demo URL (https://pr1.example.com) that can be shared instantly with QA or product teams.

4. Confidence in Testing

  • Because each PR has its own URL, developers and QA can validate changes in isolation.

  • Downstream dependencies are correctly routed within the PR environment, reducing the risk of “it worked locally but broke in staging.”

  • Teams gained higher confidence that what they test in ephemeral environments will behave the same way once merged.

5. Cultural Shift

  • Ephemeral environments changed how we collaborate:

    • Product managers review features before they merge.

    • QA teams validate fixes in isolated environments without stepping on each other’s work.

    • Engineers get feedback sooner, which speeds up merges.

In short: lower costs, faster iteration, and smoother collaboration.

Challenges & Limitations

While our selective deployment approach gave us the speed and cost savings we wanted, it wasn’t without trade-offs. Ephemeral environments are never “free,” and reusing shared infrastructure comes with some important considerations.

1. Shared Database

  • Since ephemeral environments reuse the same database as the base environment, they don’t provide full isolation at the data layer.

  • We rely on multi-tenancy to keep test data separate (e.g., each tester uses their own tenant or account).

  • This works well most of the time, but it means you can’t safely test schema migrations or destructive data changes inside an ephemeral environment.

2. Noisy Neighbor Risks

  • Because all ephemeral environments point to the same shared database, heavy test activity in one PR can sometimes impact others.

  • For example, running a large data import or batch operation in one PR may slow down queries for another PR.

  • This is the trade-off for avoiding per-PR database duplication.

3. ALB Rule Scaling

  • Each PR adds new ALB rules for routing.

  • At small scale this is fine, but ALBs have a limit on the number of rules per listener.

  • We had to design cleanup automation carefully so stale PR rules don’t pile up.

4. Frontend Caching

  • Since all frontends are served from a single S3 bucket + CloudFront, we ran into caching issues: sometimes PR builds were cached incorrectly.

  • We solved this with targeted cache invalidations, but it added some operational complexity.

5. Queue Isolation

  • PR-specific SQS queues work well for testing event flows, but they introduce extra cleanup steps.

  • Forgetting to provision the right downstream service means your messages may end up “orphaned” in the PR queues.

6. Time-Limited by Design

  • PR environments expire after a fixed window (default 60 minutes, extendable via /extendenv).

  • This keeps costs low, but it also means developers sometimes need to re-provision if reviews drag on.

Overall, these trade-offs were acceptable for our goals — but it’s important to call them out so teams adopting a similar approach understand where the limitations lie.

Lessons Learned

Building ephemeral environments taught us a lot about the trade-offs between cost efficiency and true isolation. While our selective deployment approach isn’t perfect, it struck the right balance for our team.

  1. Selective beats full duplication (for most cases)

    • We confirmed that you don’t need to duplicate everything to get value from ephemeral environments.

    • Sharing infrastructure while selectively overriding services gave us 80% of the benefits at a fraction of the cost.

  2. Cleanup automation is non-negotiable

    • Without strong cleanup, ALB rules, S3 folders, and queues would quickly pile up.

    • Automating cleanup through Step Functions ensured resources don’t leak and costs stay under control.

  3. The database is the hardest piece

    • Sharing a database keeps costs down, but it introduces risks (noisy neighbors, migration blockers).

    • Developers must be mindful when running large data changes.

  4. Slack integration made adoption easy

    • By meeting developers where they already work, we removed friction.

    • A single /provisionenv command was all it took to make ephemeral environments part of daily workflows.

  5. Observability matters even for short-lived infra

    • Debugging failed PR deployments required logs, metrics, and alerts just like production.

    • Adding observability hooks (Datadog, CloudWatch) was key to keeping things reliable.


Ephemeral environments are often seen as powerful but expensive luxuries. What we’ve learned is that they don’t have to be that way.

By reusing shared infrastructure and selectively deploying only what changes, we built a system that gives every PR its own live, testable environment without exploding costs. Developers now spin up environments in minutes, QA validates features earlier, and we ship with greater confidence.

If you’re considering ephemeral environments for your team, ask yourself:

👉 Do you really need full-stack isolation, or would selective deployment give you most of the value at a fraction of the cost?

For us, that answer unlocked a scalable, developer-friendly way to bring ephemeral environments into daily development.

301 views