Rill Data Interview | The Craft of Open Source

Interview with Michael Driscoll: Founder And CEO, Rill Data

Ben Rometsch

November 30, 2021

Ben Rometch

Host Interview

The virtuous thing often turns out to be good for business. What’s good for society ends up being good for business.

Michael Driscoll

Founder And CEO

Michael Driscoll

Founder And CEO

Check out our open-source Feature Flagging system - Flagsmith on Github! I'd appreciate your feedback ❤️

Episode Transcript

Welcome, Michael Driscoll. He’s had a glittering career and some super interesting experiences. Mike, how are you?

I’ve been great. Thanks for having me.

{{divider}}‍

Tell us a little bit about your background and what you’re working on.

On LinkedIn, I’m a Lapsed Computational Biologist. I got my start in software back in the late ’90s, I was working on the human genome project at the National Institutes of Health. That was my first exposure to big data. It’s been a love affair ever since I’ve started for companies and a venture capital firm since that period. I’m the CEO and Founder of a company called Rill Data. We’re focused on building an operational intelligence cloud service for folks. We’re in the earliest stages and excited to leverage some of the technologies that I’ve been lucky enough to be involved with, open-source tech over the last couple of decades.

{{divider}}‍

What was your first interaction with the open-source world? I presume that was back in academia.

Way back in the day when I was at the National Institutes of Health working on the human genome project, I have a recollection. We were big Sybase users back then. Sybase, which was the database that ultimately became the technology under the hood of a Microsoft SQL Server. That seems to be a common pattern for Microsoft, buy somebody else’s tech and make a lot more money off of it.

One of the engineers showed me, I recall you should be MySQL. He said, “Mike, check this out. This is an open-source database that we can run here and we don’t have to pay anyone for it.” That was probably one of my exposures to open-source and seen that moving up the stack because up until that point, open-source has been operating systems and programming languages but not things like databases. That was something that was traditionally much more commercial.

When I was in college, I was a member of the computer society. I do remember several of us tinkering with getting Linux running on R machines in the early days but that felt very like hobbyists. Not that I couldn’t see the utility directly with that first exposure to a replacement for Sybase but that was like, “This could be big.”

{{divider}}‍

The CD, Slack or something like that on the front of a magazine is a common theme on this show. It was quite hard to install back then. It was non-trivial and there was no real internet to speak of trying to figure out how. I remember modifying an X windows configuration file because I couldn’t get an expert in this work. It was the last thing I tried and I changed one bite in a file from 31 to 33 so that it references the graphics driver. It suddenly started working and it was one of the greatest experiences in my entire life.

I certainly don’t envy those days of trying to compile Gentoo Linux from source for hours and hours on end. It’s probably instructive that the creator of Gentoo Linux ended up working for Microsoft many years later. Maybe there’s a full cycle then. Microsoft has ended up being quite a supporter of open source.

{{divider}}‍

Gentoo is the possibly least Microsoft of all the next places.

Maybe even he got tired of twiddling embeds to get his distribution working.

{{divider}}‍

I do remember it’s stored in Gentoo when I was feeling like punishing myself for no good reason a long while ago. It’s interesting as well with these iterations of Linux distributions and Nik. They are constantly reinventing themselves, which is amazing to see.

It’s like dynastic cycles of history. You don’t have to wait long enough for people to forget how painful it was to manage your own time. Memory fades and people get excited using Linux again.

{{divider}}‍

I remember back in those days in the late ’90s to early 2000s when the agency I was working for was using an app server called Blue Martini. It was $30,000 per CPU and a Java application server. People forget and that’s not that long ago. People still do pay huge sums of money for databases. I don’t have a huge amount of experience with big data but it seems synonymous with open source products. I can’t think of a vertically specific big data product that is purely commercial. Is that wrong all then?

A few years ago, that would be probably an inarguable point. Arguably, Snowflake is a big data warehouse and they would probably try to put themselves in that space working at distributed scale. I certainly think the origins of big data tools invariably have been open-source. Academia is often the breeding ground for pushing the envelope of many, in fact incubating and creating many of these early tools. The other open-source project that I was very involved with, probably more in the 2000s when I came out to the Bay Area, was the open-source statistical language R.

I got exposure to that during my PhD time and certainly the toolchain that a lot of researchers have been using partly because academia never likes to pay for anything. It was always biased towards these open source tools like R and certainly MySQL, Linux. The world of Sci-fi and MFI, the scientific programming with the Python ecosystem has blossomed. The AMPLab out at Berkeley has been a source of a lot of innovation on these big data technologies like Spark, most notably. The origins of big data software and tooling have indeed been open-source but fast forward, that hobbyist era and chapter of big data are starting to end like the era of personal computing.

It was a hobbyist culture that became mature and commercialized. Some of those same people who are hobbyists are the CEOs and founders of these commercial businesses. We’re in a transition point for big data technologies and frankly, open-source, whether open-source is hobbyist or commercial. We’re starting to see some changes there as well.

{{divider}}‍

That seems counter to almost every other area of computing. I see where you’re coming from but it’s counterfactual to a lot of different areas where the migration tends to be like it has done with regular databases where they’ve tended to get more open. What do you think that is?

The way I think about open source is when we think about folks who studied the maturation of industries and folks like Clay Christensen and Geoffrey Moore, business authors that have written about this. You have periods of innovation and periods of commoditization. Traditionally during periods of innovation, what was proprietary and their unique edge was something that no one else could do or very few folks could do. They would charge a lot of money for that proprietary database or web server.

My view of the history of open-source is it’s a commoditization wave that started at the lowest level operating systems and then slowly makes its way up the stack from operating systems to programming languages to databases. In general, you’re probably correct that as technologies mature, they tend towards becoming more open-source.

Probably the other axis here is not what the technologies are but how they’re delivered. The new axis of innovation is the movement from the server environment to cloud services. Databases are so commoditized. You have the emergence of someone like Snowflake, which is a commercial database provider. Their axis of innovation is on the delivery of the software as a pure cloud service. If anyone wants to run a database, it’s fair to say that you don’t need to pay anyone to run a database on your local server but when you add the layer of running it at scale elastically in the cloud, that is challenging. Those are the orthogonal axis of innovation. One is mature and the other is not.

{{divider}}‍

For someone like me who’s fairly inexperienced around big data, I’ve never worked on projects that generate enough signal to have these huge amounts of data and interestingly talking about your experience with the genome projects, how have these quantitative quantities of data evolved in size over time? What was the cutting edge years ago? You were involved with a business that was acquired by Snap, who I imagined probably have a fair amount of data and generate a fair amount of signal. For people who don’t know anything about the space like me, what numbers have been historically talked about?

In some ways, what you consider big data is that definition is always evolving. We have more competing powers in our pocket than what we sent to the moon in the first lunar mission. When we are trying to define big data, I will try to find it in a relative sense. I often think about three scales of data. I’ll put some numbers around that but to say like, “Scales of data that you can work with as a developer.” The first is what you can keep in memory on your desktop machine.

Another way to think of it is what you can put in Excel. Generally, that’s used to be thousands of lines and Excel would fall over. Now, it’s probably millions of lines. That’s also what you can put in a data frame if you’re working in something like R or Python. You can, in a versatile way, work with that data on your desktop.

The second scale is what you can put on a single server. It might be a beefy server in the cloud but it’s a monolithic machine where you don’t have to think about any real fancy architecture. I would call that medium data. Oftentimes, you’re like, “I can’t fit it in Excel but I could probably put it in a MySQL machine in the cloud.” You probably could safely get into the billions of records with a single machine. The realm of big data is where you have to break out of a monolithic architecture.

You have to go from service single-celled organism to a multicellular design and have a distributed system where you’re managing multiple servers. That’s probably the scale at which every SaaS business running at scale has built these distributed data systems and that’s where you could certainly go depending on the size of the records. Tens of billions to trillions of records is where you’re in the realm of big data distributed systems.

That’s where certainly Snap was at that scale. That’s why we created the open-source database Apache Druid. The company that I founded that staff required Metamarkets. That’s how I think of it. Those numbers are always shifting. There’s always overkill. You always have to choose the right tool for the job. What previously required a distributed data store years ago may not require that nowadays.

There’s an open-source data store called ClickHouse that’s quite popular. You can set up ClickHouse pretty quickly on a single machine. If you can store all your data on ClickHouse without having to manage a distributed service, that’s medium data. It’s fast and good enough. Engineers always need to choose the right tool for that scale. For me, probably the big one is when you go from a monolithic to a distributor. That’s the big jump where there’s a lot of overhead.

{{divider}}‍

ClickHouse comes up probably more often than any other project whilst I’ve been recording this show in terms of people raving about it and people knowing stuff we’re talking about. I wasn’t too well versed in it but it’s interesting how it almost seems to have one its space. If you want it to run a SaaS business, you use PostgreSQL if you need regulation data space.

{{divider}}‍

If you didn’t use PostgresSQL, you’d either use MySQL. There’d be an interesting discussion around why you didn’t choose one of those and ClickHouse seems to be falling into that bracket. In terms of the innovation and moving to distributed, what were those groundbreaking projects that broke ground in that direction?

Think about what are the jobs to be done in a data-driven organization. There are a few different jobs to be done when you’re dealing with data at scale. One is moving data around that scale. Increasingly, we’ve got ships, cars, refrigerators and cash registers. Being able to manage those real-time event streams at scale requires a different distributed message bus. The Apache Kafka Project is the most successful open-source technology we’ve seen to date for grappling with data in motion and helping move data between the edge to the cloud and between different cloud systems as well.

The second job to be done inside of organizations is a transformation of data. A very common pattern would be you use Kafka to sync all your data to cloud storage and then we need to do some transformation of that data. We need to normalize it and clean it. The data scientist jokes that 90% of the work in data science is data cleansing and wrangling. The king of ETL to some extent in the open-source world is Spark. It used to be Hadoop but Spark is the winner for doing primarily batch-oriented processing, although the batches get small enough and it starts to look a lot like stream processing.

If we want to operationalize that data, take action on it and deliver it to downstream applications or dashboards, there’s often the need for a fast serving layer or storage layer. If you don’t need speed, you probably can get by with something like PostgreSQL that can be distributed. You can certainly start MySQL. There are a lot of ways to build distributed traditional relational databases. That’s one use case. Your costs will be commensurately lower for those tools but if you need a high-performance serving of data to downstream applications and things like dashboards, there’s a class of databases like Apache Pinot, ClickHouse and Druid.

They are three very popular open sources. They can time series all app databases that are going to be able to deliver generally sub-second speed to a downstream application. I often think of that as being operational intelligence. We contrast it with business intelligence, which generally doesn’t have the same performance requirements in terms of data fast in, fast out.

Those are some of the technologies that I associate with the big data stack. When we talk about distributed technologies, there are a lot of other open-source tools out there that are pretty interesting in the modern data stack, what we’re calling it but not all of them are built for scale. You might not always push terabytes of data through some of these other tools.

{{divider}}‍

In terms of the history of some of these tools, not a lot of them are common in projects that were started internally in a closed manner at organizations that are dealing with large amounts of data and then spun out into open source projects. Was that how Druid sprang into life and became a part of the Apache project?

That’s a common thread that companies solve an internal problem. There’s that saying that necessity is the mother of invention. Decades ago, we were attempting to solve the problem of delivering interactive dashboards to a client in the digital media and advertising space. They had tens of billions of records. Our vision was that if we were going to make this data useful, we should allow them to be able to explore the key metrics of that data. In this case, it was things like the price of the advertising, the volume of ads that were served and the percent of ads that were successfully delivered. The advertising platform that was our customer lived and died by these metrics. They needed to be able to track these metrics minute-by-minute with many intraday decisions being made on that data.

It was a partially open-source database called Greenplum, which was based on PostgreSQL. We did migrate to a key-value store. We used Hbase. This is the common pattern that a lot of companies do for trying to get performance at scales. They ended up using a key-value store and materialized all of the possible combinations of keys that they would think they wanted to explore. The problem is that that keyspace expands exponentially. As you add additional dimensions, at scale, you can end up managing more keys than you have events.

That doesn’t work in your favor. Ultimately, an engineer at Metamarkets, Eric Tschetter, said, “I have an idea. Let’s roll our data store and take these assumptions that we have for what we’re doing downstream.” The advantage of building it to solve your problem is that there are a lot of decisions that were made and things that we didn’t do.

Druid was not built to be a general-purpose database. It did not support initial joins at all. We eventually support those. As a result, we’re able to deliver within a few months to our customers a mind-blowing experience compared to everything else that was out there. We solved our problem in that. Druid was the result of having a lot of pain internally that we could not find an existing solution for.

{{divider}}‍

At that point, you’re designing the most opinionated products you could have because you only care about fixing this one thing and you’ve got no one else telling you no harm. It would be nice if it did join us. Can you talk a little bit about when your colleagues said, “Let’s build a database to show?” It’s an amazing idea to have. Did you have Apache Druid and open source projects materialize in your head when they said that to you or did it take some time to form this idea?

I suspect that in some ways, engineers probably carry around ideas for world-changing technologies the way that screenwriters carry around. Inside of them, a screenplay or book is waiting to be written. This is the opportunity that needs to come along. I do think that Tschetter had been thinking about this problem for a long time and had been at places like previous companies where I’d experienced similar pain points. He probably thought if this is successful, maybe we’ll think about open-sourcing it. I don’t think it’s the exception that proves his role. It’s not usually a good idea for a young startup to roll their database.

The initial receptions on Hacker News when we wrote an article saying, “We’ve built this data store that can process billions of records sub-second,” the reception was not enthusiastic. People said, “You should have used QlikView.” In general, it takes a certain amount of fearlessness to decide that nothing else out there is good enough and we’re going to build it ourselves. You have to be very thoughtful when you make that choice and a lot of it depends on the individuals involved to the type that might be able to pull it off. This is a 20% project. We said, “Let’s try it. If it doesn’t work, no problem but we’re not certainly betting the company on this experiment to rebuild the backend data store.”

{{divider}}‍

You’re not much downside but potentially business-defining upside.

A good way to think about a lot of innovation is its option value. If you can contain those bets and have enough of them, eventually, you will hit. There are a bunch of other bets that we made that didn’t go anywhere. We built an AI agent to automate the discovery of anomalies that turned out to be more like Clippy the paperclip, the oracle of the sage advisor. That’s the right way to think about it. These bets can have upside and if you can contain the downside, they’re worth doing.

{{divider}}‍

Thinking from a business point of view almost in a selfish way, for-profit companies are selfish by definition in a way, why do you think it’s so common for these companies to open source projects that seemingly give them this commercial advantage?

There are a host of reasons. There’s almost a version of enlightened selfishness is one way to think about it. Generally, when you’re doing the virtuous thing, it often turns out to be good for business anyway. There is an alignment in the long-term for doing right. What’s good for a society ends up being good for business. The inspiration of open source has always been the way that Tim O’Reilly has said, “Give back more than you extract.” When we first were thinking about open sourcing this, I remember talking to our investor Vinod Khosla about it. This was a board-level decision. It was barely a powerful piece of tech that we had innovated and we were talking about putting it out there giving it away to an extent.

My view is that we have benefited so much from all of the work that has come before us. We’re using PostgreSQL and Linux. I had been very involved in the open-source. It felt like certainly the right thing to do, especially because we weren’t directly monetizing that technology. It was embedded in our SaaS. A lot of companies choose to open source, one because they want to give back. The engineers certainly feel compelled to be contributors, not extractors, to this great ecosystem of open source technologies that they have benefited from.

Two, it’s not core to the business. That’s why I think a lot of companies like Pinterest, Airbnb and Lyft have been big contributors to open source. Their advertising companies, hotel or transportation companies are not a quarter of who they are if there’s an open-source caching tool they put out there. Three, it does certainly help when you get back to an open-source community and inspiring to other engineers. It’s a signal to other folks out there that you’re working on cool problems and you’re able to solve them in innovative ways. It ends up being an attractor for talent because engineers love working in open source and if successful, they’re not captive to one company to work on that. If they leave that business, they can continue to work on the open-source project.

From a purely selfish commercial standpoint, there are two benefits. One is you never want to embed. We’ve all seen this pattern. If you embed technology into your business and that technology starts to age over time or is not maintained, it goes from being an advantage to a concrete foundation that’s sitting in seawater. If the community gets behind it, you have access to something that will be somewhat well-maintained over time, kept up, better than if it was only in your organization. Commercially it’s a better choice to make if you depend on it.

Finally, it can provide insight. Even if it’s not core to who you are, it sometimes can provide hints of where your overall service might have value. For Metamarkets, we sometimes found that we did get some lead generation from a company that attempted to do it themselves and build a stack, including Apache Druid and then ultimately realized it would be better to work with the folks who created it.

{{divider}}‍

We’ve experienced that a little bit of flex. We’ve discovered a large organization with a large installation about products that we had no idea they were using and that’s a nice moment. In terms of the Apache Foundation, I have this vision in my mind of it being not quite mysterious but like Grand Old Wizard of open-source. Can you talk a little bit about how that project became Apache Druid, as opposed to just Druid? What that process looks like? There’s so much experience with Apache projects but I don’t have any idea how that process happens.

There are two very important and distinct but related parts of an open-source project in terms of what makes it open-source. The first is the license. The choice of license is a big one. Initially, Druid was licensed as GPL, not Apache and there are reasons for that. You see a whole host of companies like Cockroach Labs and Confluent. They’re open source community licenses that put restrictions on and in some cases, protection from a competitor like AWS, embedding their tech inside their cloud and monetizing it. Licensing is the first important piece.

The second element is governance. You can have many projects that are Apache license but they don’t have that moniker of being Apache Pinot, Apache Druid and Apache Kafka. To achieve that designation, you have to agree to follow the governance policies of the Apache Foundation. There’s a whole path by which something becomes incubated and then over time, evolves to become a top-level project. Candidly, I’ve watched that from the sidelines. I’ve not myself been deeply involved in the Apache Governance process. What I think is interesting about it is that it’s slightly cut in two directions.

Often you’ll see companies that push for Apache Governance in their early days. Let’s say it’s a group of developers that are inside of a business looking to democratize the governance of this open-source tool and not have it be dominated by one company. Oftentimes, these companies or projects have a singular driver and that’s natural. You’ll have a single commercial business that’s driving.

I’ve certainly witnessed that there’s always a little bit of tension between the Apache Governance structure and that commercial company because Apache does take that governance pretty seriously. If they see folks trying to overly promote their commercial interests or do things that are against the Apache way, which they define and they’ve got a manifesto on what the Apache culture is, there are some risk slaps that go with that so misuse of the trademark. One of the bigger elements, firstly, to be a top-level Apache project is that Apache must own the trademark for the name.

{{divider}}‍

It is a pretty valuable asset for large projects.

People can’t go around calling themselves the Druid company without making sure that all of their public-facing materials accord with that Apache Trademark Policy.

{{divider}}‍

Whose idea was that to follow that path, as opposed to putting it on, get a hub and choose a license?

The trace of the Apache license was ours while we were at Metamarkets and then ultimately, the choice to go Apache Governance was post-acquisition. Metamarkets was the net Snap. Ultimately Snap was the copyright and trademark holder. Certainly, with a lot of healthy cajoling from a larger community, it went to Apache governance.

{{divider}}‍

Snap is interested in selling crazy sunglasses and things like that. It forms to type again. In terms of what’s going on and in the future, what projects or concepts are forming that are going to become the next big projects? It’s interesting as well. There has been this constant reinvention from having to Spark to people understanding the technical problems and solutions. Where do you think there’s going to be new growth in areas around big data?

The emerging most important class of data that we have in the world is event streams. As evidenced by this global chip shortage, we have put chips in things. The consumer in the internet of things is no longer science fiction. It’s real. As we continue the instrumentation of the planet, we start to have this global digital nervous system and infrastructure has been laid down. That’s been a huge driver for a company like Confluent that went public and is a $10 billion business commercializing Apache Kafka.

We’ve got these sensors but sensors without actuation have no intelligence. It’s time for us to start building the next layer of intelligence on top of this global digital nervous system. Internal all of these event streams coming off of cars, thermostats soda machines and operationalize that data. It helps drive better decisions and more autonomous decisions for businesses, platforms, services and products. That’s a very broad statement but there’s a lot of work to be done.

How do you go from an event stream to improving a product or service? It’s very hard to do. It can take months for people to do a better job routing vehicles in the fleet around hot weather. There are so many use cases and opportunities for real-time intelligence, decision-making over this new class of data. We’re the early innings of that. Certainly, this is what we’re investing our efforts and exploring.

{{divider}}‍

You’re talking very much about the final stage of doing something useful with all that information.

There’ll be later stages. A lot of folks are talking about AI. That’s the sexy area of data. AI is flight but there’s the saying goes, “You need to walk before you run and you need to run before you fly.” Many companies struggle with much more pedestrian details of doing sums, counts, averages or distinct counts at scale continuously with low latency. Before you can do real-time predictive routing of vehicles like for FedEx, you first need to know where the trucks are and where the bad weather is, then compute the simple things quickly. Those metrics can be fed into AI applications and services to higher-order things.

My view on the inside is that many companies are struggling to identify. We have a scooter company that struggles to identify when a scooter is offline, abandoned and illegally parked in a city. It seems like a very small problem but they haven’t found a way to solve it and get that information to their team in minutes.

{{divider}}‍

Societally, there’s going to come to a point where the democratization of that data becomes more of a talking point amongst people. From someone who is technically experienced but not specifically around big data like the new oil, there’s talk of a certain class of problems where there are 3 or 4 companies on the planet that can work on it because they’re the only ones that have that data around it. That never seems to get talked about and it’s a fairly technical complex topic. The UK government has done a good job in opening up a whole bunch of stuff, the data within the government. I wonder whether no one seems to talk about doing that for private companies.

It’s challenging. People love to talk about the democratization of data. I’ve seen that every decade, there’s always a fascination with open data sets. Here in the United States, we had Data.gov. It was built during the Obama administration but by and large, most of these data democratization efforts have not been that successful. Certainly, with the rise of privacy concerns, that does make the likelihood of many data sets that have some personal, identifiable information in them less likely to be shared.

Some things may bridge this but I have two partying thoughts on this. One, there is a lot of work being done on privacy that will enable companies to securely share data sets without violating the privacy of either their commercial interests and keeping things what they’re doing private or consumer privacy laws. Homomorphic encryption is another term that’s been talked about. Facebook announced that they’re trying to analyze encrypted data safely. That’s one area and probably the applications that are often around marketing.

The second thing is rather than looking to share open data that’s already out there, what I often see as a pattern is instead of getting access to existing data, which sometimes is very hard to do, even if people want to give it to you, there’s so much bureaucratic cruft in getting access. The cost of instrumentation keeps falling. What I often see is there’s never just one way to figure out what’s happening in the world.

If you want to know where people are going where chat foot or automobile traffic patterns, there’s an abundance of ways and data sets to provide insight on that. It could be sensors in the car, phones, traffic lights or street cameras. There are many ways that we’ve instrumented the planet. We’re going to continue to add more sensors. Many companies will end up having their private paths to developing large-scale data sets. Those paths will get cheaper and easier to build.

{{divider}}‍

Mike, thanks so much for your time. It’s been an interesting walkthrough in an area that I’m not very well versed in. I wish you all the best with your new venture. I look forward to seeing where it goes.

Thank you, Ben. It’s great to connect and chat. I look forward to talking more in the future.

About

Michael Driscoll

I’ve spent over two decades as a technologist, entrepreneur, and investor. Rill (2020-present) is a cloud service for operational intelligence; Metamarkets (acquired by Snap, Inc. in 2017) was a real-time analytics platform for digital ad firms; Data Collective (2011-present) is investing $2B+ in assets in deep tech; Dataspora (acquired by Via Science in 2011) was an early pioneer in data science; CustomInk.com (1999-present) is a leader in custom apparel online.

Available for talk, coaching and workshops on:

Links from the Episode

No items found.

Success!

We'll keep you up to date with the latest Flagsmith news.

Must be a valid email

Rill Data Interview | The Craft of Open Source

Episode Transcript

Links from the Episode

Subscribe