Our scariest release to date!

Ben Rometsch

Our first ever release of any importance was when we open sourced Flagsmith on June 13th, 2018. Looking back, we built a lot of the product in private, before we published it on Github. I think a lot of projects start this way where the intention is to open source, but the earliest code is written in private.

Maybe it’s a developer thing; we don’t want to show our earliest ideas before we’ve created something of value...especially where we are still adding a lot of polish to the code? At any rate, if we were going to do it again today, we would have done all of this in the open like we build our product today. The compounding benefits are massive and it’s one of the best changes made to date as a team. #buildingintheopen

Back then, releases weren’t scary. No one was depending on the software except our own team…and to be honest, no one was paying attention to our project on GitHub. Anything that broke would be identified and fixed at the rate that we could get to it. But as we grew, more and more people started to really depend on Flagsmith to deliver consistent product experiences and releases.

Today, we are now serving billions of requests each month on our SaaS product and supporting some of the world’s most successful companies with our self-hosted product. Needless to say, the stakes are much higher for releases today than they were yesterday.

In addition, Feature Flagging tools generally sit in a tricky space. They have two pretty thorny requirements: They need to serve flags as fast as they possibly can, and they need to not go down. Those two aspects combine to amp up the pressure on quality, and not committing something that could impact either.

Every product and team have those parts of their product/platform that are particularly hair raising when it comes to altering how they work in production. For e-commerce companies, it tends to be changes to how their shopping carts or checkouts work. At B2B companies, it is anything that impacts their SLAs with customers.

At Flagsmith, we have a few things that are really important for us to slow down and make sure we get right. We monitor them publicly on our status page, but in detail, these are the things that keep us up at night:

Anything that touches our SaaS API
As a feature flagging platform, our customers make calls to our API end-points.If our API behavior changes, or if the API is down or slow, it’s a big problem.
Anything that touches our SDKs
Our SDKs are deployed into our customers’ products. Similarly to our API, if those change, we might need customers to update the version they are using. This can be painful for customers!

So, after much building and feature requests from our customers, we decided to make a release that makes fundamental changes to both! Haha.

Introducing the Edge API

When we first built Flagsmith, we didn’t realise quite how important API latency was going to be to our customers. We decided on the AWS London region for our API location, not because we were in London ourselves, but because it provides a good balance between latency to both the Americas and Asia.

But the speed of light really isn’t quite fast enough for a lot of our customers! Folk in Sydney, for example, were never going to see response times lower than ~600ms. I can’t change the laws of physics.

Edge networks are popular with front end frameworks and infrastructure like Vercel, but generally they don’t have to deal with state, and dealing with state is a massively big deal. So we decided on a plan. We were going to replicate the pieces of data that our SDKs rely on to generate flags. And then we were going to provide an Edge compute platform that pulled from this data to serve flags at the edge.

This meant that we could serve Sydney in < 200ms. In fact, we realised we could achieve sub 200ms latency anywhere in the world.

After a lot of prototyping and kicking the tyres of a bunch of different platforms, we decided on DynamoDB global tables and Lambda. And after a lot of work, we were ready to go. But there was a big problem. All our SDKs, and all our customer applications, pointed to api.flagsmith.com. And our new edge API only replicated a fraction of the endpoints that we needed to power our SDKs. So we decided to serve the Edge platform from a new domain: edge.api.flagsmith.com.

This meant a breaking version change with our SDK’s. But that was OK, because we were going to completely rewrite them anyway.

Wait, what?

Introducing Flagsmith v2.0 SDKs

When you are serving flags directly to client-side platforms, like web browsers or mobile apps, the best you can do is a round trip to an API; the data has to come from somewhere. But what about if you are powering the flags for a server rendered website, or an API? Couldn’t the flags come directly from your server application?

Customers running server side applications commonly requested the ability to do just that. The problem for us was that our API had all the logic for our flag engine. Our SDKs were simply thin wrappers that hit our API. We needed to move the processing from our API to our SDKs. But we provide SDKs in 9 (count them!) different languages. Our API is written in python. How do we get our rules engine from Python to Ruby? Or Elixir? Or Rust? We realised we were at the bottom of a very tall mountain.

Like most difficult problems, you can reduce them down into digestible chunks. So we rolled our sleeves up and started hacking. We factored our rules engine out into a python library. Then we wrote a LOT of test cases to exercise the rules engine.

Then we rewrote that engine in all 8 other server-side languages.

It was a lot of work, but it unlocked a huge amount of flexibility in our platform. You can now decide where your flags were generated, and you can use flags to drive server side applications with no additional latency. You can read more about the design here.

Putting it all together

If you are going to make a breaking change, you should bite the bullet and put as much as you can into the release. Releasing the Edge API at the same time as the new SDKs meant that we could point them all at the new Edge endpoints. Sure, we still needed a breaking change, but at least there were two huge benefits to that change: an Edge API and the server-side flag engine.

We spent a lot of time designing a migration process for our customers that would eliminate any downtime as a result of that migration.

And then we waited. Bottom line; we were scared to push the button. We have hundreds of customers relying on the platform that powers the applications of hundreds of millions of their customers. After a lot of testing, checking and rechecking, we took the plunge and released everything in the period of a couple of hours.

We work a lot with customers to help them implement feature flags and in a lot of cases, they can absolutely help with the scary releases, but for this release, we didn’t use feature flags. There were way too many moving parts for this to be powered by a single flag. Existing customers can request we migrate their data over to the new Edge API, and then start using the v2 SDKs. New customers deploy onto Edge right away. Lucky them!

We knew that this was a fundamental shift to our platform so we decided to roll it out to all users going forward. We also decided that all of our existing and future customers should benefit from the best technology we have. With that in mind, this is free and included in all of the plans at Flagsmith. We hope it helps your team to release faster and continuously improve your products!

-Ben

About the author

Flagsmith co-founder. Besides Flagsmith, Ben has founded several other companies, and he currently serves on the Governance Board of OpenFeature, a CNCF Sandbox Project. He's an advocate for open standards and open source and also hosts “The Craft of Open Source" podcast, where he interviews creators and maintainers from the open-source community.