Why LaunchDarkly Went Dark During the AWS Outage—And Why Flagsmith Didn’t

Matthew Elwell

Context

On October 20th, LaunchDarkly suffered a serious outage due to the us-east-1 failures from AWS. In this blog post, their SVP of Engineering details what the cause was and how they were affected.

I wanted to talk about this because, while the AWS service interruption took down large swathes of the internet, Flagsmith didn’t suffer any downtime. I’ll go into how our architecture differs from LaunchDarkly’s and, specifically, why we’ve set it up to be more resilient.

Let’s unpack the key parts of LaunchDarkly’s outage pulling from their blog post.

Deep dive

us-east-1

The first phase (between October 19, 11:50 PM and October 20, 11:40 AM) began late on October 19, when AWS us-east-1 experienced a major outage. Between 11:50 PM and 11:40 AM the next day, AWS services such as EC2, Lambda, DynamoDB, and Route 53’s control plane were degraded or unavailable. LaunchDarkly depends on these services, and many of our own capabilities were affected…

Our Edge API also makes use of Lambda and DynamoDB but was not affected by the outage. In this case, this was because we do not have any infrastructure deployed to us-east-1. This was a conscious decision due to the fact that this region is AWS’s least stable region (as backed up by the following data obtained from Gemini).

AWS Region	Partial Outages (2022)	Total Outage Duration (2022)
N. Virginia (us-east-1)	23	61 hours 7 minutes
Oregon (us-west-2)	5	7 hours 0 minutes
Ohio (us-east-2)	4	4 hours 25 minutes
Ireland (eu-west-1)	1	25 minutes

Exposure to risk

Our service has been designed with redundancy in mind, but also simplicity—we don’t rely on huge amounts of technology from the AWS platform. This reduces our reliance on their infrastructure as much as possible and helps to limit our exposure to failures in the AWS platform.

This concept of simplicity also ties in to some of the key differences between Flagsmith and LaunchDarkly.

LaunchDarkly, as one of the incumbents in the space, offers a feature-rich platform with a surplus of bells and whistles. It’s an impressive platform, but these extras mean that they have to rely on complex infrastructure to deliver—increasing their exposure to failures.

Flagsmith, on the other hand, keeps things simple and offers all that you need for your feature flagging programs without any unnecessary functionality. In fact, you can self-host Flagsmith with just a single docker image and a Postgres database—it doesn’t get much simpler than that!

Data sovereignty

Event ingestion gradually degraded and resulted in data loss for both US and EU environments until 3:00 PM after which data loss stopped due to recovery.

This statement intrigued me a little bit. From a data sensitivity and sovereignty perspective, I’m surprised to read that an outage in a US region of AWS had any impact on EU data. With Flagsmith, companies that need to ensure their data stays in (or out of) specific regions often opt in to our private cloud or self-hosted offerings, ensuring their data will never leave the locality that they specify.

LaunchDarkly’s widespread outage

Specifically, server-side SDKs across all regions experienced connection errors, reaching ~99% globally.

I picked up on two things here:

Firstly, as a mathematician, the “~99%” made me think of this proof, where one can prove that 0.99 recurring is in fact equal to 1. Have LaunchDarkly turned this around to prove that 100% = ~99%…?
Secondly, the fact that an issue started by the failure of one AWS region had such a massive impact as to cause a ~complete loss of connectivity across their server side delivery network is really surprising.

When using Flagsmith’s SaaS platform, you immediately get access to our Edge API which is deployed to eight different regions, each with a local datastore. What this means is that each region is able to operate and serve traffic independently, so even if we lose a region, or in a worst case scenario, the head is cut off and we lose the ability to make changes to flags via the dashboard, our ability to serve them is never affected—meaning that our customers' applications will never be affected. Our network architecture always ensures that traffic is routed to the nearest healthy region, which will continue to serve traffic, even in the event that any combination of the eight AWS regions are completely offline.

How to protect yourself from future outages

The above points are mostly commentary on the article from LaunchDarkly, in addition to covering how Flagsmith’s SaaS architecture was able to survive a large-scale outage in the AWS network. It’s also worth touching on other ways in which you can protect yourself from such occurrences. By choosing Flagsmith’s self-hosted or private cloud offerings, you are in control of the infrastructure and the architecture.

We can deploy our private cloud solution to any hosting provider and any region that you choose, including inside your own network. We then take care of the management, meaning that it still looks like SaaS to you, but with the benefits of self-hosting.

When it comes to self-hosting, you are in full control of the deployment and its uptime, meaning that you’re able to design for failure using your own policies, with our guidance on all things Flagsmith.

Our customer, Delinea, details some of the reasons they chose our self-hosted version here:

Having as much control as possible and ensuring things like business continuity, failover, disaster recovery should a system go down… these are all vital. Flagsmith checked a lot of boxes for us. It was self-hostable, it was feature-rich, and it was affordable. - Dariel Marlow, VP of Cloud Engineering, Delinea

Migrate from LaunchDarkly

If you were one of the customers that was impacted and you're looking for a change, Flagsmith makes it relatively easy to migrate away from LaunchDarkly. You can learn more about our importer here.