TABLE OF CONTENTS

Industry/News Company Updates Best Practices and How To Languages & Technologies Product Customer Stories

Feature Flag Testing: How Enterprise Teams Build Real Product Learning Loops

Asaph Kotzin

Every quarter, somewhere in your organisation, a team is about to spend four months building a feature that will quietly die in production. Not because the engineering was bad. Because nobody built the infrastructure to find out it was the wrong bet before the cost became permanent.

The planning process looked rigorous. OKRs, design sprints, customer advisory calls. There was discipline, but the discipline was wrapped around educated guesses: research sessions capturing what customers say, which correlates loosely at best with what they do.

That gap between what a team believes and what production reveals is where most product investment goes to burn. Feature flag testing, done properly, is the only mechanism that closes it.

Not as a pre-release checkbox. As the backbone of a system that makes production experimentation safe, auditable, and repeatable—one that turns whether a feature will work or not from a matter of opinion into a matter of observation.

At its core, a feature flag is a mechanism for decoupling deployment from release—wrapping a new feature in a conditional so code ships to production without it being exposed to users. Straightforward enough at a small scale, but at enterprise scale, the challenge isn’t implementing the flags, it’s building the testing infrastructure around them so that production becomes a place where you learn, not just a place where you pray.

A real-world product learning loop, disguised as a cautionary tale

The McDonald’s AI drive-through story gets told as a punchline. The system spent a couple of years adding items customers didn’t order, misidentifying requests, generating exactly the sort of social media attention nobody at corporate wanted. They pulled it in June 2024 after running at hundreds of locations.

Look at it differently.

A major company built genuine real-world exposure. Accumulated signal across a significant sample of production traffic. Recognised the product wasn’t going to work at scale. Made a decision based on evidence rather than opinion.

That’s what a functioning product learning loop looks like. The system worked. The outcome just wasn’t the one they wanted and they had the data to know it, which is more than most organisations can say about features they shipped last quarter.

The problem is that the vast majority of businesses can’t absorb the same losses as McDonald’s during this testing phase. Most enterprise teams have never been able to run a real learning loop because the infrastructure to do it safely didn’t exist inside their organisation.

McDonald’s, which ran its AI trial between 2021 and 2024, continued to grow its quarterly revenue throughout. Most companies don’t have that margin for error, which is precisely why the testing infrastructure matters more than the ambition.

Source

How feature flag testing closes the gap

AI coding tools have made this worse, not better. When writing code gets cheaper, the bottleneck shifts upstream—to the quality of the decision about what to build. Moving faster without better validation infrastructure just means arriving at the wrong place sooner.

The answer isn’t better beta testing. It isn’t more user interviews. It’s feature flag testing, operating as the foundation of a production experimentation system rather than a gate at the end of a development cycle.

Why traditional testing misses the point

Beta testing sounds like a learning mechanism. There’s a structural problem with it.

By the time a new feature reaches a beta cohort, the engineering investment is committed and the design is fixed. What development teams are collecting at that stage is their customers’ reactions to something that already exists, not observations of whether the feature changes how users actually behave. That’s a meaningful distinction, and it’s the one most teams blow past.

What’s actually needed is the ability to expose functionality to specific user segments in a production environment and watch what happens—without that exposure requiring a separate deployment or creating a compliance problem.

Testing strategy at the enterprise level has traditionally struggled to accommodate this. The tooling required to handle multiple flag states and different versions in production requires infrastructure most teams don’t have, so the instinct is to reach for automated testing frameworks. More test cases. A more complete test suite. Extending the test execution pipeline.

Unit tests and integration testing are necessary. They don’t answer the question of how real users respond to a new feature in a live production environment. That question lives in a different layer entirely, and no amount of additional testing overhead in a controlled environment resolves it.

The limitations of traditional testing aren’t a matter of rigour. They’re a matter of what kind of signal traditional testing is actually capable of generating.

The real blocker is governance, not code

Recognising those limitations is the easy part. The harder question is why most enterprise teams aren’t already running production experiments.

Ask a product team and the first answer is usually compliance. Ask the compliance team and they’ll describe an uncontrolled change with no authorisation trail, no segmentation documentation, no reversal mechanism, no audit record.

They’re not being obstructionist, they’re accurately describing what’s been put in front of them. Given that framing, caution is the right move.

The issue is that most experiment requests arrive without the governance infrastructure that would make approval defensible. Implementing feature flags properly changes that conversation.

Progressive rollout means new code reaches 1% of users before the entire user base sees it. A flag that disables a feature is a toggle rather than a redeployment. Approval workflows capture sign-off in the same system where the flag change is recorded, producing an automatic audit trail rather than a reconstructed one.

For regulated industries—financial services, healthcare, government—this is the difference between a proposal that requires trust and a configuration that can be reviewed against a framework other teams helped define.

Flag lifecycle management matters here, too.

Long-lived flags that outlive their original purpose don’t just sit there. They accumulate. And they interact. A dark forest of shadow architecture that engineers navigate by memory and tribal knowledge—until someone leaves, or an incident happens, and suddenly you’re spelunking through flags nobody can explain. Maintaining clear ownership of each flag, scheduling regular flag cleanup, and documenting a flag’s purpose as part of routine flag management aren’t bureaucratic overhead. They’re what keeps the forest navigable.

That governance foundation is the prerequisite. Once it’s in place, feature flag testing becomes a continuous feedback loop rather than a one-time exercise before launch.

What feature flag testing actually involves

With the governance infrastructure addressed, it’s worth being precise about what testing with feature flags covers. The topic spans several distinct scopes that behave differently.

Unit tests and code paths

The simplest part. Flags exist in code as conditional branches, so you can write test cases for each code path independently, regardless of which flag state triggers them. The flag’s purpose is to control which code blocks execute at runtime. That doesn’t change what unit tests do.

Test each path in isolation. Testing overhead here is minimal, whether you’re testing simple toggles or more complex flags with multiple variants.

Integration testing and handling multiple flag states

This is where complexity arrives and doesn’t leave.

When flags exist across a system, integration testing must account for how different flag states interact with one another. A naïve approach—testing every possible flag combination—quickly becomes unmanageable. Ten boolean flags produce 1,024 system states. Real enterprise products often have dozens of operational flags active simultaneously, making exhaustive combination testing impractical and the management of flag dependencies a genuine challenge for any automated testing framework.

What works instead: test the production state: the configuration real users will encounter. Test the all-off default state to ensure stability. Define user personas that represent meaningful segments, and write test cases around those personas rather than every permutation.

This is how you validate functionality without the test suite becoming unwieldy. End-to-end testing follows the same principle: cover the paths that matter for different user groups, not every theoretical system state.

Testing in production with real users

This is where feature flag testing delivers something no staging environment can.

When QA discovers unexpected behaviour in a feature that’s live under a flag, the blast radius is limited to the specific user segments that have been given access. Not the entire user base. No full rollback required. The flag is disabled, the affected segment is narrowed, and the issue is addressed before the feature is released more broadly.

This is what makes continuous delivery with feature flags genuinely different from traditional release management. Development teams can ship multiple versions of new code in a single day because each deployment into the production environment is insulated by flags. A deployment is not a release, and a release is not a commitment.

Deployment is separated from release, and quality assurance becomes an ongoing process—gathering feedback, monitoring performance, adjusting rollout—rather than a binary gate before launch.

According to the 2024 DORA State of DevOps report, elite-performing engineering teams—representing fewer than 20% of organisations surveyed—deploy multiple times per day and recover from failed deployments in under an hour.

Feature flags are central to how those teams achieve that cadence, because the ability to release features to specific user groups rather than all users simultaneously removes much of the risk that forces lower-performing teams to deploy less frequently.

Feature flags and A/B testing: what’s the difference?

Understanding the mechanics of feature flag testing naturally leads to a question that often gets conflated: What’s the relationship between feature flags and A/B testing? It shouldn’t be.

Feature flags control who sees what. They’re the delivery mechanism; the infrastructure that routes users in different user segments to different versions of a feature.

A/B testing adds the measurement layer: you’re not just serving two variants to different user segments, you’re measuring which one produces the behaviour you want, using key metrics to make data-driven decisions.

When these are integrated in the same feature flag management tool, the loop closes: implement the flag, define the user targeting, ship the feature to a controlled percentage of users, measure the outcome, and make the call.

Without that integration, development teams end up with flags in one system and analytics in another. The signal gets lost between them, and the ability to make genuinely data-driven decisions from production behaviour depends on someone manually correlating two separate data sets. I’ve watched this happen. The data exists in both systems. The insight exists in neither.

A feature flag tool combined with A/B testing capabilities means measurement starts the moment you flip the flag. That’s not a convenience; it’s the difference between a production experiment and an anecdote. User segmentation and feature releases need to live in the same system for the feedback loop to close properly.

What good feature flag testing infrastructure looks like

Services for feature flag rollout and testing at the enterprise level need to cover more than simple on/off toggle functionality. Here’s what a well-structured setup actually requires:

Environment separation

Flags in development, staging, and production are independent. Testing teams can set their own rules in different environments without touching the production configuration, which allows the test suite to run against realistic flag states without risk to real users.

Progressive rollout

New flags go to a defined segment—for example, 1%, an internal team, or beta users—before any wider release. This phased rollout enables you to gather feedback from real users without exposing unexpected behaviour to your entire user base. The percentage adjusts without a new deployment.

Instant kill switch

Disabling a flag is a toggle, not a redeployment. In regulated industries, this is often a compliance requirement as much as an operational one: the ability to remove a feature immediately, without engineering intervention, needs to be available at all times. The 2 AM version of you will be grateful this exists.

Automatic audit trail

Every flag change, every approval, every rollout decision is logged automatically—captured at the time of the change, not reconstructed after the fact. The right governance infrastructure is what turns compliance from a blocker into a manageable process, and what makes feature flag testing sustainable at enterprise scale.

Approval workflows

Sign-off happens in the same system as the flag change. No side-channel email chains, no manual record-keeping. Other teams can review and approve changes against a documented, auditable process.

Integrated metrics

The ability to monitor performance against the key metrics that define whether a feature is achieving its goals, not just whether it deployed successfully. You need to know whether the thing is working, not just whether it’s running.

Choosing a feature flag tool with built-in testing support

Most feature flag tools handle the bare minimum competently at a small scale: boolean flags, basic user targeting, a simple dashboard. Enterprise development teams need more than that.

The governance layer—audit logs, approval workflows, role-based access control, environment isolation—needs to be built in from the start, not bolted on when the compliance team asks for it six months later, which is when you discover that a decision to add governance later was always a bet against your own growth.

The same applies to the experimentation layer. If A/B testing capability is separate from flag management, the signal gap between them becomes a permanent friction point. User targeting and the measurement of feature releases need to live in the same system.

For organisations in regulated industries, deployment options matter too.

A cloud-only tool creates a hard ceiling for data-sensitive teams. Self-hosted or private cloud options—with the full feature set intact—are often a requirement, not a preference. Many tools that look capable in a proof of concept reveal these limitations when it comes to managing features at production scale, in a real compliance environment, with multiple teams involved.

Here’s a table to help you decide when you need an enterprise-level feature flag testing solution:

Capability	Small-scale tools offer	Enterprise requirement
Environment separation	Single environment or manual duplication	Independent flag states across dev, staging, and production
Progressive rollout	Basic on/off toggle	Percentage-based and segment-based rollout with no redeployment
Kill switch	Requires redeployment to disable	Instant flag disable—a toggle, not a deployment
Audit trail	None, or manual logging	Automatic log of every flag change, approval, and rollout decision
Approval workflows	Side-channel email or Slack sign-off	Sign-off captured in the same system as the flag change
A/B testing	No experimentation support	Native A/B and multivariate flag support, with integrations to your existing analytics tools rather than forcing a new one
Deployment options	Cloud SaaS only	Cloud, self-hosted, and private cloud: a full feature set across all options

The forest doesn’t get smaller

McDonald’s pulled the AI drive-through because they had enough data to know it wasn’t going to work. That’s not failure. That’s a system producing a signal and an organisation acting on it.

Most enterprise teams are still trying to get there—not because they lack ambition, but because they lack the infrastructure to run production experiments safely. The dark forest of unmanaged flags and ungoverned releases keeps growing, and the teams navigating it keep doing so by memory and instinct rather than instrumentation.

Feature flag testing, done properly, isn’t a development practice bolted onto the release process. It’s how you make the forest legible—how you turn every release into a hypothesis with a measurable outcome instead of a bet you can’t unwind.

The teams that win aren’t the ones that build fastest. They’re the ones who structurally eliminated the cost of being wrong, so speed stopped requiring courage.

See how Flagsmith’s controlled rollout and flag governance work in practice →

About the author

Principal PM at Flagsmith.

June 24, 2026

Feature Flags in DevOps: What They Are, Why You Need Them

Asaph Kotzin

June 22, 2026

What Is a Dark Launch? The Ultimate Software Development Guide

William Sigsworth

June 15, 2026

What Is Product Lifecycle Management?

William Sigsworth

June 9, 2026

What GitLab Feature Flags Can Do for Your Release Workflow

William Sigsworth

June 3, 2026

The Engineering Team's Guide to Release Strategies That Actually Work

William Sigsworth

June 1, 2026

You Can Now Integrate Flagsmith with GitLab! Here's How You Do It

Asaph Kotzin

May 27, 2026

The Benefits of A/B Testing, and Why Feature Flags Make It Even Better

William Sigsworth

May 20, 2026

The Developer's Playbook for Beta Testing That Actually Works

William Sigsworth

May 20, 2026

Code References: See Exactly Where Your Feature Flags Live in Your Codebase

Evandro Myller

May 18, 2026

What Is Blue-Green Deployment? The Complete Guide

William Sigsworth

May 12, 2026

Smoke Testing Explained: Catch Build Failures Before They Reach Your Users

William Sigsworth

May 7, 2026

When Canary Alerts Go Wrong: How We Fixed It and Doubled Down on OSS

Kim Gustyr

May 6, 2026

Release Testing: A Complete Guide for Development Teams

William Sigsworth

May 5, 2026

What Is a Kill Switch in Software and Why Do Developers Need Them?

William Sigsworth

April 29, 2026

How to Implement CI/CD: A Practical Implementation Guide

William Sigsworth

April 27, 2026

What Is CI/CD? A Plain-English Guide to Faster, Safer Software Delivery

William Sigsworth

April 21, 2026

Rolling Deployment Vs. Blue-Green: Which Strategy Fits Your Pipeline?

William Sigsworth

April 20, 2026

What Is Feature Management and Why Does It Matter?

William Sigsworth

April 15, 2026

What Is Trunk-Based Development? A Complete Guide

William Sigsworth

April 13, 2026

Deployment Frequency: The Metric That Reveals How Fast Your Team Really Ships

William Sigsworth

April 9, 2026

OpenTelemetry, without the vendor lock-in: Introducing full observability for Open Source and Self-Hosted Flagsmith customers

Kim Gustyr

April 7, 2026

How to Migrate from LaunchDarkly to OpenFeature in 6 Steps

Tanaaz Khan

March 31, 2026

How Prometheus, Flagsmith, and Some Good Old-Fashioned Compression Helped Us Solve Customer Pain

Matt Althauser

March 26, 2026

Trunk-Based Development vs. Gitflow: Choosing the Right Branching Strategy

Mia Loiselle

March 25, 2026

Why OpenAI Paid $1.1 Billion for a Feature Flag Company

Matthew Elwell

March 20, 2026

The Engineering Leader's Guide to Scaling Feature Flags

Tanaaz Khan

March 19, 2026

6 Tips to Reduce and Manage Technical Debt in 2026

Tanaaz Khan

February 24, 2026

Three teams. Eight hours. Three amazing features: Flagsmith’s 2026 Lisbon Offsite and Hackathon

Adrian Gregory

February 17, 2026

Vibe Coding and Feature Flags: The New PM Playbook for Faster Product Validation

Asaph Kotzin

February 9, 2026

10 Best Practices to Build and Ship AI Features With Minimal Risk

Tanaaz Khan

January 29, 2026

Tracking Feature Flag Changes and Evaluation with Flagsmith and Sentry

Daniel Efe

November 28, 2025

We Built Our Own MCP Server for Engineers & Release Managers

Adrian Gregory

November 21, 2025

7 PostHog Alternatives for Feature Flag Management

Tanaaz Khan

November 12, 2025

Why LaunchDarkly Went Dark During the AWS Outage—And Why Flagsmith Didn’t

Matthew Elwell

November 7, 2025

Statsig Alternatives: 8 Best Feature Flag Platforms Compared

Tanaaz Khan

November 5, 2025

Integrating Datadog Workflows with Flagsmith for Automated Reliability

Daniel Efe

October 24, 2025

Progressive Delivery for Building LLM-Powered Features

Pete Hodgson

October 23, 2025

What is the Four Eyes Principle? A Developer's Guide to Safer Flag Changes

Tanaaz Khan

October 17, 2025

De-Risking AI Adoption: How Feature Flags Help Enterprises Move Fast Without Breaking Trust

Adrian Gregory

October 7, 2025

Monitoring Feature Flag Performance with Flagsmith, Prometheus, and Grafana

Daniel Efe

September 25, 2025

What is Release Management and How Does it Work in Regulated Industries?

Tanaaz Khan

September 17, 2025

Banking and Modern Observability: Dynatrace Insights

Andreas (Andi) Grabner

September 4, 2025

No More Hardening Phases: Testing in the Age of Continuous Deployment

Pete Hodgson

September 1, 2025

How Modernisation is Changing Open Source Banking

Rob Moffat

August 5, 2025

Use Grafana to Track Feature Health in Flagsmith

Mia Loiselle

August 28, 2025

6 Lessons From the World's Best Open-Source Founders

Ben Rometsch

August 27, 2025

Feature Toggles and Feature Flags: Understanding the Key Differences

Tanaaz Khan

August 25, 2025

8 Types of Deployment Strategies (And How Feature Flags Help)

Ben Rometsch

July 31, 2025

Moving to Progressive Delivery with Feature Flags

Ben Rometsch

July 11, 2025

Top 7 Feature Flag Tools for Enterprises in 2026

Tanaaz Khan

June 3, 2025

Moving Fast, Without Breaking Things: Modern Software Delivery with Feature Flags

Pete Hodgson

June 4, 2025

TypeScript Feature Flags: A Next.js Example

Michael Dinerstein

May 14, 2025

Embracing Modernisation in Banking Through Platform Engineering

Benjamin Brial

May 9, 2025

Transitioning to Modern Authorisation Management

Alex Olivier

April 22, 2025

What Are Feature Flags? Everything Engineering Teams Need to Know

Ben Rometsch

April 7, 2025

A Conversation with Komerční Banka's Chief Software Architect

Mia Loiselle

March 26, 2025

GitOps for Feature Flags Using Terraform and Terrateam

Malcolm Matalka

March 25, 2025

Why It’s Time to Test in Production: Best Practices

Tanaaz Khan

January 22, 2025

How We Improved Our Docker Image Security Using Chainguard's Wolfi

Kim Gustyr

January 7, 2025

6 Best Enterprise-Grade Harness Alternatives & Competitors

Tanaaz Khan

October 28, 2024

How to Roll out Pricing Changes With Zero Customer Complaints

Matthew Elwell

September 16, 2024

How to Use Feature Flags for Trunk-Based Development

Kyle Johnson

August 21, 2024

7 Best LaunchDarkly Alternatives & Competitors

Tanaaz Khan

August 12, 2024

How Global Banks Use Feature Flags to Stay Competitive

Tanaaz Khan

July 24, 2024

How To Guide: Flagsmith Grafana Integration

Pradumna Saraf

July 23, 2024

New in Flagsmith: 2024 Feature Roundup

Matthew Elwell

July 23, 2024

Don’t Let a Flawed Release Take Your Company Down

Ben Rometsch

June 26, 2024

How to Guide: Flagsmith GitHub Integration

Pradumna Saraf

May 28, 2024

6 Best Firebase Remote Config Alternatives & Competitors

Tanaaz Khan

May 16, 2024

How to Transition to Modern Feature Management in Banking

Ben Rometsch

March 21, 2024

5 Feature Flag Management Pitfalls To Avoid To Keep Your Flags in Check

Tanaaz Khan

February 29, 2024

The Best Thing about Founding a Remote-First Company? Pickled Onion Monster Munch and The Beautiful Game

Ben Rometsch

February 28, 2024

Flagsmith Jira Integration Guide: A Comprehensive How-to Guide

Abhishek Agarwal

February 16, 2024

Guide: How to Create Observability-Driven Development with Feature Flags

Savan Kharod

January 31, 2024

Build vs. Buy for Feature Flags: My Experience as a CTO with a 20+ Engineer Team

Daniel Engelke

January 16, 2024

Announcing the Flagsmith Referral Programme

Anna Redbond

January 15, 2024

How We Measure Feature Flags’ Success

Kyle Johnson

December 20, 2023

Customer Story: Serenis

Anna Redbond

December 7, 2023

Announcing the Flagsmith Jira Integration

Anna Redbond

June 6, 2024

Spring Boot Feature Flags: A Step-by-Step Implementation Guide with a Working Java Spring Boot Application

Abhishek Agarwal

November 22, 2023

Employees on Bootstrapping

Anna Redbond

November 14, 2023

Our POV: When Bootstrapping Works (and When It Doesn't)

Anna Redbond

October 25, 2023

How to Onboard Feature Flag Management Tools

Anna Redbond

October 12, 2023

When is it time to move to feature flag software?

Olga Diaz

September 26, 2023

Why We Bootstrap

Ben Rometsch

September 6, 2023

The Enshittification of Basically all Digital Design. But in this Case, Specifically, the Slack Redesign.

Ben Rometsch

January 9, 2025

Ruby Feature Flags: A Step-by-Step Guide to Implementing Feature Flags in a Ruby on Rails Application

Zeeshan Afridi

September 1, 2023

Unlocking Efficiency: Transitioning to Modern CI Processes

Geshan Manandhar

August 29, 2023

Customer Story: Vontobel

Anna Redbond

August 17, 2023

It's Time to Move to Modern Observability Tools and Progressive Delivery: Insights from Dynatrace

Andreas (Andi) Grabner

August 2, 2023

Moving to Modern Software Development and Continuous Integration for Banks: Insights from Romano Roth (Zühlke)

Anna Redbond

August 1, 2023

Developer-Led Podcast: Bootstrapping a Commerical Open Source Company to $1M ARR

Anna Redbond

July 24, 2023

Open Source Startup Podcast: Why Feature Flagging Should be Open Source with Ben Rometsch

Anna Redbond

July 20, 2023

Get The Analytics You Need: A/B Testing with Feature Flags and Your Existing Stack

Kyle Johnson

July 18, 2023

Open-Source in Banking: Rob Moffat from FINOS Talks Barriers, Benefits, and Pushing the Battleship to Adoption

Anna Redbond

June 30, 2023

Customer Story: Rain (VP of Platform Engineering)

Anna Redbond

June 30, 2023

Customer Story: Rain (Tech Lead)

Anna Redbond

September 26, 2024

PHP Feature Flags: A Step-by-Step Guide in a Working Laravel Application

Geshan Manandhar

January 15, 2025

What is Canary Deployment? When and How To Use It

Geshan Manandhar

October 10, 2024

Node.js Feature Flags: a Step-by-Step Implementation Guide with an Express.js Example

Geshan Manandhar