TABLE OF CONTENTS

Industry/News Company Updates Best Practices and How To Languages & Technologies Product Customer Stories

Progressive Delivery for Building LLM-Powered Features

Pete Hodgson

Everyone's adding AI-powered features to their products, but many product teams are realising that building the MVP of these features is the easy part. A smart engineer can throw together a really impressive proof-of-concept in a few days, and a product team can polish that demo into a ready-to-launch feature in perhaps a few weeks.

It's the next part—continuing to improve AI performance after that initial launch—where the pain starts. Generative AI is an unintuitive, jagged intelligence. A change that improves how the AI handles certain types of problematic input can introduce unintended issues in other types of input which had been working fine previously. Teams can end up chasing their tails, constantly adjusting prompts and experimenting with new models or context management techniques without making much forward progress on overall performance.

If you're not testing these incremental changes in a systematic way, you're at risk of falling into this game of whack-a-mole. Manual "vibe check" testing, where you test a change against a small set of sample inputs, isn't sustainable.

When you take a step back, this shouldn't be a surprise—we've known for a long time that relying entirely on manual tests to verify software is a bad plan. A suite of automated tests provides a much more robust protection against regressions. Likewise, we've learned that Progressive Delivery—rolling out new features in a controlled way and using experimentation to validate the impact of each change—allows teams to safely increase the pace of software delivery.

We can apply Progressive Delivery to AI-powered product features too: in this article you'll learn how, using feature flags and AI evals.

‍

Progressive Delivery: moving fast with safety

Progressive Delivery is a set of engineering and product management techniques which allow teams to safely accelerate the pace at which they deliver value to users. It builds on the concepts of Continuous Delivery, layering on product experimentation. Continuous Delivery is all about keeping your software in a continuously releasable state. Progressive Delivery adds in the idea of incremental releases using feature flags, along with a feedback loop of technical- and business-focused metrics which allow us to measure the impact of each change.

For example, let's imagine we're an insurance provider and we want to improve our onboarding experience with a smart address form. Rather than forcing users to type street name, city, state, zip code, and so on, we will provide them with a free-form text box and use a 3rd party library to automatically figure out the full address.

‍

BEFORE

AFTER

‍

To roll out this change using Progressive Delivery techniques, we would introduce a feature flag which controls whether a user sees the existing manual form or the new automatic form. When the shipping address form is being displayed, our code would check the feature flag and render the appropriate UI component.

We'd deploy the code change to production but initially configure that feature flag to be off for all users. When we're ready to roll out this new feature, we'd first test it internally by turning the feature on in production but only for internal users. Assuming we didn't find any issues, our next step would be a canary release, turning the feature on for 5% of our user-base. Next we'd move on to an A/B test, turning on the feature for 50% of our users so we could measure how the new experience performs side-by-side with the old one. Once our A/B test concludes successfully, we would complete the rollout, turning the feature flag to 100%. Soon after, we would retire the feature flag and remove it from our codebase.

At each stage during this rollout, we'd be watching our technical metrics (error rates, page load time, etc) as well as business metrics (onboarding conversion rate, for example). If we notice an issue, we can immediately turn the feature flag off. This safety net gives teams the courage to make changes at a faster pace, knowing that issues can be detected and remediated quickly.

Progressive Delivery for AI

Now let's see how we can apply these ideas to a change in an AI-powered feature. We're now working on a retail banking app. We've recently added a feature that allows the user to ask questions about their transaction history and get answers in real time, via an LLM.

‍

We've noticed the AI's answers can be overly verbose at times, so we're going to tweak the prompt that powers this feature to try and encourage the LLM to be more succinct (changes are bolded):

UPDATED PROMPT

^{You are an assistant at a bank who helps customers answer questions about their transaction history.}

^{I've included the user's transaction history and their question below.}

^{Be sure to answer the question succinctly; use no more than 5 sentences.}

<transactions>
{transaction_history}
</transactions>

<user_question>
{user_question}
</user_question>

‍

We can choose to use Progressive Delivery techniques for this prompt change, similar to any other code change. We'll deploy both the old and the new version of the prompt, and use a feature flag to control which version of the prompt is used to answer a given question. We can then measure the impact of the change by checking our product analytics.

‍

Choosing a prompt via a feature flag

There are a few different ways we can control which prompt is used via a feature flag.

If your prompt templates are deployed as hardcoded strings alongside your code then you can simply ship both versions of the prompt as hardcoded constants, and choose which one to use based on a boolean flag:

if feature_flags.is_feature_enabled("use_succinct_prompt_for_transactions_query"):
    prompt_template = SUCCINCT_TRANSACTIONS_QUERY_PROMPT
else:
    prompt_template = ORIGINAL_TRANSACTIONS_QUERY_PROMPT

# … use prompt_template to send the question to LLM …

You could also manage the actual prompt template within your feature flag management platform by creating a string feature flag with different variants of the flag containing the different prompts:

prompt_template = feature_flags.get_feature_value("transactions_query_prompt")

# … use prompt_template to send the question to LLM …

This second option may be tempting. It opens up the possibility for you to iterate on these prompts without even needing a code deploy—you can simply update the feature flag variants in your flag management platform and they'll be live instantly! However, this is also a pretty risky approach, even with A/B testing in place. You're basically live-coding in production, leaving version control behind and bypassing the quality checks of your continuous delivery pipeline.

A middle ground approach is to use a dedicated prompt management platform (langfuse, braintrust, arize, amongst others) to store your prompts, and use your feature flag management platform to choose which prompt ID or version ID to use:

prompt_version = feature_flags.get_feature_value("transactions_query_prompt_version")
prompt_template = prompt_store.get_prompt("transactions_query", version=prompt_version)

# … use prompt_template to send the question to LLM …

‍

Measuring the impact of a prompt change

Whatever approach we use, with a feature flag decision point in place we are ready to perform a progressive rollout of the new prompt. Just like our smart address form, we can start with a canary release, configuring our feature flag platform so that 5% of transaction queries use our new prompt. If things look good, we can ramp up to a 50% A/B test and let that experiment sit for a while. Again, if things look good, we would ramp to 100% and then shortly after we'd remove our flag and clear out the old version of the prompt.

There's an important question here though: how exactly do we measure the impact of this change? How do we know if "things look good"? We have a few options:

Traditional metrics

We can look at whether the new prompt has any effect on our existing product metrics:are people more likely to ask multiple questions when using the new prompt, for example. We can also check on technical metrics: does the new prompt affect how quickly we're able to retrieve an answer from the LLM, or perhaps how many tokens we're spending.

User feedback metrics

We could explicitly ask our users for feedback on the LLM's response: "was this answer helpful?".

‍

You've probably noticed this pattern yourself in AI-powered product features—it's a simple way to get focused feedback on an LLM's performance, with the caveat that you're likely to only get feedback from users who feel strongly, one way or the other.

Evals

A third option for measuring the impact of this prompt change is via AI evals— purpose-built instrumentation that specifically measures the performance of your LLMs. AI evals are an important topic that's well worth deeper study (Hamel Hussein's writing is a particularly great resource), but to give a quick sense of what they look like, here are some example evals we might look at for our product questions feature:

How many characters are in the LLM's response? This happens to be exactly the metric we're trying to move with this change!
Does the response answer the user's question? You might wonder how we measure this in an automated way. A common approach is to feed the question and the answer to another LLM and ask it "does this response answer the user's question" (a technique called LLM-as-a-judge). This won't work perfectly, but it will spot responses like "I'm sorry, I'm not able to answer your question based on the information available".
Does the response contain information that wasn't explicitly provided in the user's transaction history? Another LLM-as-a-Judge eval, this time attempting to spot when the LLM is either hallucinating, or pulling in potentially incorrect information from its training set.

Running these evals against your LLM inputs and output in production will provide additional metrics that you can use to assess your prompt change—both to see if the change had the impact that you intended, and also to detect whether the change had inadvertent effects on other aspects of your AI's performance.

That second part is what makes a good suite of AI evals so valuable—they help prevent the game of whack-a-mole that you can otherwise fall into when making LLM changes. Imagine, for example, that our prompt change did successfully reduce the verbosity of the AI responses but it also made the LLM more likely to respond with "sorry, I'm not sure" in certain circumstances. This sort of unintuitive regression is quite possible when working with the strange alien brain of an LLM, and it can be impossible to spot if you're only testing you change manually - i.e. "vibe checking" - or via automated tests that use a small "golden dataset" of inputs.

Using feature flags to A/B test your LLM changes in production with evals generating a feedback loop provides much more confidence in the impact of a change, which in turn allows you to rapidly iterate towards more magical AI-powered features.

Not just prompts

In our example above we were A/B testing a prompt change. These are a particularly good candidate for progressive delivery, but there are other LLM changes that can also be A/B tested in the same way.

What happens if we switch models?

The AI landscape changes fast. At some point you're going to consider switching to a newer model - or a smaller model, or a larger model. You might even explore switching to a different model provider.

Running different models side-by-side in a canary release or A/B test lets you use evals and product metrics to compare the quality of their responses, while also looking at technical metrics to measure differences in latency and cost.

If you're not comfortable running an experiment in production with your user-base as guinea pigs, you could also perform a Dark Launch - send production traffic to both models and compare the results, but only use the result from your existing model.

Tweaking meta parameters

You can also use an A/B test to look at how model settings like temperature, top-p and top-k affect the quality of responses. For scalar values like these you may even want to run a multivariate test.

Advanced AI engineering

You might be considering creating a custom fine-tuned model, or using a different RAG strategy, or even introducing agentic workflows. All of these changes can (and should) be done in a safer way using the Progressive Delivery techniques we've discussed in this article.

‍

Conclusion

Building AI-powered features can feel like wielding a powerful new form of magic: unpredictable, sometimes jaw-dropping, occasionally maddening. But it turns out that the same Progressive Delivery techniques that have helped engineering teams ship regular old software faster and safer—feature flags, canary releases, A/B testing—apply just as well to this new magic.

Just as with other types of software, the key is a controlled rollout with a clear feedback loop. The controlled rollout uses tried-and-true patterns: feature flags, canary releases, A/B tests, and friends. While you can continue to take advantage of traditional feedback mechanisms - product metrics and technical observability - look to AI evals for a purpose-built objective measure of your LLM's performance.

The teams that will create the most compelling AI-powered products aren't necessarily those with the most sophisticated prompts or the latest models. They're the ones who can iterate fastest while maintaining quality, and Progressive Delivery is how you get there.

About the author

Fractional CTO & OpenFeature Governance Board Member.

October 23, 2025

What is the Four Eyes Principle? A Developer's Guide to Safer Flag Changes

Tanaaz Khan

October 17, 2025

De-Risking AI Adoption: How Feature Flags Help Enterprises Move Fast Without Breaking Trust

Adrian Gregory

October 7, 2025

Monitoring Feature Flag Performance with Flagsmith, Prometheus, and Grafana

Daniel Efe

September 25, 2025

What is Release Management and How Does it Work in Regulated Industries?

Tanaaz Khan

September 17, 2025

Banking and Modern Observability: Dynatrace Insights

Andreas (Andi) Grabner

September 4, 2025

No More Hardening Phases: Testing in the Age of Continuous Deployment

Pete Hodgson

September 1, 2025

How Modernisation is Changing Open Source Banking

Rob Moffat

August 5, 2025

Use Grafana to Track Feature Health in Flagsmith

Mia Loiselle

August 28, 2025

6 Lessons From the World's Best Open-Source Founders

Michael Dinerstein

May 14, 2025

Embracing Modernisation in Banking Through Platform Engineering

Benjamin Brial

May 9, 2025

Transitioning to Modern Authorisation Management

Alex Olivier

April 22, 2025

What Are Feature Flags? Everything Engineering Teams Need to Know

Ben Rometsch

April 7, 2025

A Conversation with Komerční Banka's Chief Software Architect

Mia Loiselle

March 26, 2025

GitOps for Feature Flags Using Terraform and Terrateam

Malcolm Matalka

March 25, 2025

Why It’s Time to Test in Production (+ How to Do It Safely)

Tanaaz Khan

January 22, 2025

How We Improved Our Docker Image Security Using Chainguard's Wolfi

Kim Gustyr

January 7, 2025

6 Best Enterprise-Grade Harness Alternatives & Competitors

Tanaaz Khan

October 28, 2024

How to Roll out Pricing Changes With Zero Customer Complaints

Matthew Elwell

September 16, 2024

How to Use Feature Flags for Trunk-Based Development

Kyle Johnson

August 21, 2024

7 Best LaunchDarkly Alternatives & Competitors

Tanaaz Khan

August 12, 2024

How Global Banks Use Feature Flags to Stay Competitive

Tanaaz Khan

July 24, 2024

How To Guide: Flagsmith Grafana Integration

Pradumna Saraf

July 23, 2024

New in Flagsmith: 2024 Feature Roundup

Matthew Elwell

July 23, 2024

Don’t Let a Flawed Release Take Your Company Down

Ben Rometsch

June 26, 2024

How to Guide: Flagsmith GitHub Integration

Pradumna Saraf

May 28, 2024

6 Best Firebase Remote Config Alternatives & Competitors

Tanaaz Khan

May 16, 2024

How to Transition to Modern Feature Management in Banking

Ben Rometsch

March 21, 2024

5 Feature Flag Management Pitfalls To Avoid To Keep Your Flags in Check

Tanaaz Khan

February 29, 2024

The Best Thing about Founding a Remote-First Company? Pickled Onion Monster Munch and The Beautiful Game

Ben Rometsch

February 28, 2024

Flagsmith Jira Integration Guide: A Comprehensive How-to Guide

Abhishek Agarwal

February 16, 2024

Guide: How to Create Observability-Driven Development with Feature Flags

Savan Kharod

January 31, 2024

Build vs. Buy for Feature Flags: My Experience as a CTO with a 20+ Engineer Team

Daniel Engelke

January 16, 2024

Announcing the Flagsmith Referral Programme

Anna Redbond

January 15, 2024

How We Measure Feature Flags’ Success

Kyle Johnson

December 20, 2023

Customer Story: Serenis

Anna Redbond

December 7, 2023

Announcing the Flagsmith Jira Integration

Anna Redbond

June 6, 2024

Spring Boot Feature Flags: A Step-by-Step Implementation Guide with a Working Java Spring Boot Application

Abhishek Agarwal

November 22, 2023

Employees on Bootstrapping

Anna Redbond

November 14, 2023

Our POV: When Bootstrapping Works (and When It Doesn't)

Anna Redbond

October 25, 2023

How to Onboard Feature Flag Management Tools

Anna Redbond

October 12, 2023

When is it time to move to feature flag software?

Olga Diaz

September 26, 2023

Why We Bootstrap

Ben Rometsch

September 6, 2023

The Enshittification of Basically all Digital Design. But in this Case, Specifically, the Slack Redesign.

Ben Rometsch

January 9, 2025

Ruby Feature Flags: A Step-by-Step Guide to Implementing Feature Flags in a Ruby on Rails Application

Zeeshan Afridi

September 1, 2023

Unlocking Efficiency: Transitioning to Modern CI Processes

Geshan Manandhar

August 29, 2023

Customer Story: Vontobel

Anna Redbond

Anna Redbond

June 30, 2023

Customer Story: Rain (VP of Platform Engineering)

December 1, 2020

Customer Story: Palo Alto Software

Ben Rometsch

March 14, 2020

What I’ve learned creating a React Native performance monitor

Kyle Johnson

September 20, 2024

How to Setup Feature Flags in Android using Kotlin

Anna Redbond

February 23, 2023

The actual infrastructure costs of running a global Edge API (part 2)

Ben Rometsch

May 3, 2023

Integrating your Flagsmith Project with Datadog: A Step-By-Step Guide with Real-Time Metrics

Abhishek Agarwal

May 10, 2024

Python Feature Flags & Toggles: A Step-by-Step Setup Guide in a Flask Application

Matthew Elwell

July 2, 2025

Java Feature Flags & Toggles: A Step-by-Step Guide with a Working Java Application

Abhishek Agarwal

November 16, 2022

Adventures in Terraform: How and why we built our Terraform Provider

Gagan Trivedi

April 8, 2025

Angular Feature Flags: a Step-by-Step Guide with a Working Application

Geshan Manandhar

January 30, 2025

Golang Feature Flags: A Step-by-Step Implementation Guide with a Working application

Abhishek Agarwal

June 29, 2022

Elixir feature flags: a step-by-step guide with an Elixir example

Ben Rometsch

June 6, 2022

How Banks Implement Feature Flags - Interview with KB Bank | Flagsmith

Ben Rometsch

June 16, 2022

.NET feature flag: a step-by-step guide with Xamarin example

Ben Rometsch

June 14, 2022

Our scariest release to date!

Ben Rometsch

June 15, 2022

The actual infrastructure costs of running SaaS at scale (billions of requests/month)

Ben Rometsch

January 2, 2022

How To Use Swift Feature Flags: iOS App with code examples

Ben Rometsch

May 11, 2022

Our CI/CD and release management process at Flagsmith

Ben Rometsch

January 21, 2022

How eFuse Uses Flagsmith for A/B & Multivariate Testing

Ben Rometsch

May 19, 2022

Flagsmith Submits OpenFeature as CNCF Sandbox Project | Flagsmith

Ben Rometsch

November 17, 2021

Using Flutter Feature Flags to Release Features Without Risk | Flagsmith

Ben Rometsch

May 24, 2024

How to Use JavaScript Feature Flags & Toggles to Deploy Safely [React.js Example]

Ben Rometsch

December 31, 2021

6 Metrics to Monitor When Rolling Out a New Feature Flag

Cassandra Polzin

September 29, 2021

How Inflow Improves Conversions Through A/B Testing with Flagsmith and Mixpanel

Ben Rometsch

October 7, 2021

5 Things I Learned Going from Open Source to Commercial Open Source

Ben Rometsch

April 25, 2024

Feature Flags Best Practices: The Complete Guide

Geshan Manandhar