Dask

Interview with Greg Hayes: Director of Engineering, Dask

Ben Rometsch

June 14, 2022

Ben Rometch

Host Interview

They may spend a year or two trying to develop and perfect a given machine learning model and then they decide they have got something that is valuable.

Greg Hayes

Director of Engineering

Greg Hayes

Director of Engineering

Check out our open-source Feature Flagging system – Flagsmith on Github! I’d appreciate your feedback ❤️

I am excited to introduce to you, Greg Hayes. I will be totally honest about this, Greg. I am a complete naive person when it comes to big data and data processing. Some of this discussion is going to be for my benefit where you can explain to me how all this stuff works. I like to welcome you on and ask you to introduce yourself, your project and your business.

It is great to be on. I appreciate the opportunity to get a chance to chat. I lead the engineering team at Coiled Computing. Coiled Computing is working on making it easy to deploy desks in the cloud and to scale Python for scientific computing or any other business purpose that users may have.

Do you want to talk a little bit about the origins of Dask and what was the cause of the first commitment to that project?

When Dask was originally built, Spark had started to take hold, and it is in the JVM. There was a recognition that there was a very large Python native in the scientific computing community. A lot of libraries have Python and that entire ecosystem, which is all incredibly solid, but they are limited by a single-core in-memory type of computer paradigms. There was a recognized need as data volumes were scaling up to be able to do distributed computing. People wanted to be able to use native Python in a distributed computing environment. The genesis of Dask was to see if you could find a way to solve that problem and build on top of that existing ecosystem.

How did you find yourself working in this area? Maybe not so much now, but a while back, it was a fairly specialized niche. Is that fair to say?

I think so. A lot of people that come into scientific computing with Python follow a background somewhat similar to mine. My graduate work is in synthetic organic chemistry in a department that eventually ended up founding a common tutorial chemistry program, a common tutorial synthesis for drug design. I had had exposure to big data processing and scientific computing in that context. I left that for a while while I was starting my career and then came back into that domain through product design innovation and the application of big data to product design. I am trying to do a data-led development as opposed to more exploratory type processes.

When you say product design, do you mean for new drugs?

I left that. It was industrial coding and chemistry. I am leveraging historical data to design products more effectively, faster, and more efficiently. That eventually led to machine learning. I spent some time working in that space for a while. That was my first introduction to Dask. It is when you start working with data that is on your laptop. One of the nice things about Dask is it allows people to get started in a local machine with data that does not necessarily fit in memory.

I did that for a while. I ended up moving into another organization that had committed to building a data science function, and I led the tech stack for that. You then get into questions of, “How do you scale up beyond the local machine? How do you then move those workloads into production?” I led the ML apps techstep and development of the tech stack, where we were looking to choose a platform that would allow data scientists to get work into production faster. That is when we started leveraging Dask, Kubeflow, and a few other open-source tools to try and accelerate that development process.

Out of interest, what size of workloads are you talking about at this point in time?

Prior to coming to Coiled, we were looking at hundreds of gigabytes or maybe up to a terabyte at times. We see people nowadays using Dask for hundreds of terabytes of workloads, processing geospatial images, and things like that. It is massive data sets.

The processing of this stuff, how much of that now is machine learning compared to regular computation?

The way that I think about breaking this up is you have data engineering workloads, which would be bringing in data ETL or ELT processes where you are doing transformation, cleansing, prepping that data for, in some cases, business analytics and other cases for data science users. They are going to then pick that data up in some raw format. They do munching work with that data in combination with other data sets to do either inference at scale or to train machine learning models in the hopes of finding something they may end up deploying into a production setting.

Is that changing over time? My naive assumption would be that more workloads are getting pushed into machine learning models.

My experience has been you will have significantly more effort in enterprises around data engineering processes because you have a lot of data coming in. The volume of data that organizations are dealing with is growing at an incredible pace and getting that data into a format that is understandable, usable, and can be leveraged in conjunction with related data sets is a massive undertaking. My experience has been that you typically see more data engineers bringing data and prepping it. You also have that data gets used for search and purposes other than data science workloads. You also see business analytics, basic reporting, and there are a lot of other uses for that data. We have seen more organizations investing in data science. They are building up those functions.

How did you start getting involved in Dask in that first instance?

I was working in another organization. We were looking for ways to accelerate the deployment of machine learning models. It is pretty well documented that one of the biggest challenges that most companies are having nowadays with machine learning is figuring out how to accelerate the deployment of machine learning models. The standard process that a lot of organizations use nowadays is what we refer to as a rewrite and deploy. You have a data science team that work separately from the rest of the business. They are off doing their own thing, maybe working in their own language, or even a combination of languages like MATLAB or Python to take your pick. Oftentimes, it is a bit of a mismatch.

They may spend a year or two trying to develop and perfect a given machine learning model and then they decide they have got something that is valuable. The business wants to test it in production. They give it to a set of software engineers who then rewrite that model in some production of your language that team is comfortable with then deploy it. That takes another 12 to 24 months.

It is a very long life cycle. What organizations are starting to turn to is figuring out how to let data scientists more readily deploy their work in an iterative manner into production rather than trying to spend. The approach that we were attempting to adopt is figuring out how to allow data scientists to quickly deploy maybe even prototypes into production and test them for business value right rapidly.

They yield value then iterate on them to improve the performance of the models. That brings in a whole host of constraints. You need to be able to write unit tests and do CI/CD. You need to integrate your workflows with the rest of the Agile team, assuming you are using Agile, in a coherent manner. After a lot of work, we spent time trying to figure out what the best language was that we could use and what tools fit within those languages. Eventually, we landed on Python as the language that we wanted to use organization-wide, and then we started exploring the space of how you could develop an iterative manner. A logical approach seemed to be to start within memory datasets, and that allows data scientists to do their prototyping and their exploration very quickly.

When they need to scale up or scale-out, you could choose a tool that lets you do that. If we have selected Python, the way that we want to work and the language that we are going to use. The Pandas then becomes the natural in-memory extension for that, then Dask is a very natural step for scaling out when that becomes necessary. That was my introduction to Dask. It enabled our data scientists to leverage some of the best components of Python into machine learning and develop models in an iterative manner.

What is your current day-to-day involvement with the open-source projects?

After building out this team, I have been engaging in that process for a few years. I started looking around. I would always wanted to be involved more closely in open-source. I saw the Coiled was hiring. They are looking to build more on the engineering side, I reached out and it seemed to be a good fit. I lead the engineering team at Coiled where we are building the ability to deploy Dask at scale.

Can you talk about the evolution of the project? As far as activity goes, it is definitely up there. You have got a very large number of contributors and a lot of activity going on. How many people are involved and looking after that from a close point of view?

In the Coiled engineering team, we are about 13 to 15 people in that ballpark in total. We are doing larger contributions. Dask is also heavily integrated into the rapids project. You have support out of the Nvidia team and they have a large number of contributors that are engaged in that. Anaconda contributes as well, and then there are a lot of other projects. There is an entire ecosystem around Dask. There is work on the X array. We have been involved with the Optune open source project. There is a very long list of other organizations or projects that we engage with to maintain and build out that ecosystem.

That is unusual from the folk that I normally talk to. How did you go about managing the governance of that in terms of making decisions if you have got large organizations like Nvidia? Who is steering the boat?

There is a formal governance doc for the Dask project. We are engaged with NumFOCUS as well and we follow the government stocks. We have monthly meetings for the entire community. Anyone is welcome to come. That is where decisions get made in aggregate. It follows a model quite similar to Pandas, which is also part of NumFOCUS.

Is it mainly Coiled folks that are working on the project day-to-day or are you getting lots of pull requests and work coming in from outside of that?

There are a lot of contributions from the Coiled folks. A lot of contributions from the Nvidia and CUDA folks, then it small falls off. That is fairly common with a lot of open-source projects. You have a large number of contributors or a set of people that are making larger contributions, and then there is a long tail with the Dask project because there are integrations with many other parts of the scientific computing stack.

In terms of the licensing that you chose, has that been consistent throughout the lifetime of the project?

Generally, it has been pretty consistent with a lot of the other projects that you see and that are part of NumFOCUS and the Python scientific computing community. We are in BSD 3 close, and I do not think that has changed for a long time.

Has that ever caused any issues for the projects with a different license or is it always been fairly straightforward from that side of things?

I do not know of any large issues that have ever arisen around licensing. We have had conversations with some people who have wanted some minor adjustments to the licensing, so they were comfortable making contributions. I have ever heard of any large issues or concerns around the licensing. I can tell you that licensing as a general topic is something prior to being part of Coiled, I have been engaged in conversations with other organizations about the importance of licensing. It is considered by enterprises when they look at what open-source projects they are going to bring into their tech stack.

Do you ever get feedback from them like, we can't use this because it is BSD 3?” I guess BSD 3 is fairly permissive from that side of things.

The general consensus that I have heard from stakeholders since joining Coiled and also prior to that is any of the BSD licenses, MIT and Apache most are comfortable with. The copyleft licenses are the ones I think that I have heard reservations expressed about when organizations enterprises are looking at pulling an open-source project into their tech stack.

What tools do you use to manage that community and how effective are they?

On the community side, almost all of the engagement happens through GitHub. There are GitHub issue trackers even down to community level repositories and then for each of the individual projects. It seems to work well. Almost all of the interactions happen in there with contributors certainly or people that are deeply embedded in the project. We also have discourse channels and we see some engagement there, a little bit on stack overflow, and several other places. There are some interactions on Reddit. We see some conversations coming in there. We generally try to meet people where they are. We want to be as welcoming to new people that are coming into the project as we possibly can.

Are there any aspects of GitHub that you would like to change or things that you would like to see them do differently?

GitHub has worked well for managing interactions with the project. From a community perspective, it has been good. The great thing about GitHub from a community point of view is anyone who comes into the project can go in. They can see the entire history of the conversation that has happened. If you look at some other channels and more chat-based conversations, it is hard to get that history sometimes, but we have someone that comes into the project. There is a feature request that has been made, and they can search to GitHub issues. Oftentimes, they will find that those of other conversations have happened. Maybe things were deliberately removed or added so they have the entire context. That is invaluable for people that are coming to the project, especially for long-running projects.

We moved all of that engineering to GitHub. I have been blown away by a platform that services huge projects like yours to ones with twenty people. The product design is pretty incredible considering the differences in requirements that different folk need. I am always blown away by how good it is. I totally agree with the value of having the code, pull requests, issues, conversations, or all there from the start of time. As far as the project is concerned, it is super powerful.

I can't imagine what it would be like to try and migrate that to some other platform right and write in terms of history.

Like the close pull requests, the value in that is tremendous. In terms of the usage of the platform, is that something that you guys are tracking in any way?

The usage of GitHub?

No, the Dask itself. What we see at the top of the iceberg, we do not know how much they are under the water that we are not aware of.

That is something that we have been looking at quite a bit of PyPI stats. We are looking frequency of downloads from PyPI. We are also looking at traffic coming into our docs pages and seeing how people flow through the website. We are doing some redesigns. They are trying to make that easier to navigate and easier for users to get the information that they need to be successful in the project. We look for engagement in other areas. We have a sense of where Dask is being integrated into other tools, so then we can see secondary usage, as well.

That is because they are reaching out to you directly.

We have an engagement there as well. Some of the other projects that leverage Dask as a distributed computing platform. We have an engagement with them as well.

How did you deal with breaking changes within that interface to the platform and the design of that side of things?

We try very hard to go through deprecation cycles when we make any changes to the public-facing APIs. We have a few cases where we have had users that have reached out. Maybe they have been using a non-public facing API. We'll try to revert those and let them go through a deprecation cycle where we can. When you have lots of people using the platform, you do not necessarily always know, and you want to be respectful of other projects that are leveraging your tool. You want to enable them as well.

How did you get a feel for whether you are doing that too quickly or too slowly?

Typically, because we have good solid engagement with our collaborators on GitHub, they will let us know if something has happened and then we can go back and revert those things.

Have you had any kind of kilometers moments within the lifetime of the project?

Not since I have come into the project in a more active way. I do not know that I would be able to say what maybe happened many years ago, but not since I have come into the project in a more active sense. I am sure that there were some learning moments as we were getting started.

What is the origin of it? Who worked on the first 50 commits of it?

Matt Rocklin was the Founder. He is one of the original contributors. Jim Crist is also one of the original and very active members of the Dask community.

He is still working active on the projects.

Matt founded Coiled Computing and Jim is still working on the project. He is part of the Coiled team.

Can you talk a little bit more about Coiled? That is a commercial business that is built partly around Dask.

One of the things that we have found is, and I even experienced in my previous life, we decided we wanted to use Dask. There are some tools out there for deploying Dask, but when you move into an enterprise setting, there is a whole host of capabilities that companies want to see implemented from security, scalability, reliability, and all those sorts of things.

In my previous life, we had used some of the existing tools and found that we liked what Dask brought to the table, but we wanted a more robust way to manage. Coiled did not exist then. We ended up going to go kind of down a different path. Coiled now came into being, and it is focused on making it easy to deploy Dask clusters quickly, at scale, for a user or a company in their own cloud-hosted environment. Our mandate is to deploy clusters quickly, scalably, and easily, and they will enable users to use Dask and scale their Python code to be running a production.

Would it be fair to say that these workloads are being run all over the place? It is not like 90% of it is in AWS. I am assuming it is like a lot of like private tendency data centers and things like that. Is that fair to say?

It is true that there are many organizations that deploy Dask on-premise. Coiled this focus on deploying in a cloud environment. We are deploying in AWS and GCP with some of the other large public cloud providers in our roadmap.

Does AWS have native products that are similar to this?

I am not aware of any AWS native products that are targeted at scaling Python specifically natively. Dask would be unique in that regard.

What have you got planned for the future? Have you got large things that you are still wanting to work on or are you more maintaining things?

There is a lot that we have left to do. On the Coiled side, we released our V2 deployment, which made a big improvement in the user experience for deploying Dask clusters. We are also moving toward feature parity. On the Coiled side, we are moving toward feature parity with V1, which enables GPUs, spot instances, and then we also want to do a lot of work in the near future around adaptive deployments because of the work workloads that you see for data engineering, data science are not necessarily static. They flex and a lot of our users are expressing a continual interest in that adaptive scaling capability. It was the volume of data that is flowing through a pipeline or being used in a workload grows, and we are able to scale an up and down to manage their infrastructure costs. Those are some of the things that are our near-term roadmap.

These data workloads are getting larger and larger and larger. Do you ever worry that the amount of data that we are generating as a society is going to start out-pacing these tools and bits of hardware that they are running on?

Is it a concern? Yes. As technology developers, as we start to deal with some of these challenges, we are finding different ways to handle that. Combining streaming with large-scale data processing where you are not handling the same data on an ongoing basis helps reduce the compute costs, incremental processing updates in place, enabling those capabilities. Those are all things that we have in a longer-term roadmap for Dask. Those helped make basically the processing of the data more manageable. That is something that companies are definitely looking at. We have these large volumes of data coming in. Most of the time, that data is not going to be touched. It is at rest.

Long-term, we do not want to say, “Anytime a data point changes right in time, or we get an update to something we do not want to reprocess five terabytes of data. We want to reprocess a little bit that it may have changed. Those are all capabilities that are being built out both in Dask, Coiled, and other places as well. There are unique ways to deal with this. You do not want to reprocess large volumes of data.

Is the question better framed as one of cost in relation to data volume rather than technical capability?

It is certainly both. At petabytes of scale is a ridiculous volume of data. We have tools that we can handle. From a company's perspective, they are going to be looking at ways to extract value from that data in as efficient of a manner as possible. That is oftentimes a cost driver. There was a point in time, we are being able in a cost-effective, scalable, and time-efficient manner being able to extract usable information for businesses using satellite data was not achievable.

We can achieve that now. We can do it in a cost-effective manner. Continuing to build on that is incredibly valuable as technology developers enable that, like Dask, Coiled, and others, driving those costs becoming more efficient. That enables that information, satellite data, in this example, to be used in more granular ways or in different ways to add value to other businesses.

How much of this is down to folks like Nvidia and general-purpose GPUs as opposed to CPUs. The satellite example you gave, did that suddenly become into the realms of possibility on account of software, hardware, or both?

I think it is both. Certainly, the work that is being done in GPS enables things that would not have been manageable in a time-efficient manner many years ago or cost-effective manner, but there is still a significant place for CPUs and handling a lot of this data.

Is it fair to say that this area of the industry has purely commercial organizations within it? My impression, especially around this side of things, like entities, organizations, and software projects seem to, by default, be open because you have not got the data, and there is not to value.

It is certainly true that if you do not have the data, then there are certain kinds of values are not going to be able to extract.

Are there pure closed source companies competing with Coiled?

Not that I am aware of. A lot of the tools for big data processing got their start on the Apache ecosystem, but there have been others that are grown out of that. A lot of that ecosystem was developed around JVM-based tools. There is a solid recognition that data scientists want to use Python and there is a well-developed, mature ecosystem of tools to enable data scientists.

That is where they get started. They can get started easily in a local computing environment. Being able to then take that, if you are goi g to have to scale that work up or scale it out, is invaluable both in terms of being able to train people to get them started, but then be able to leverage the skills that they built across a wide variety of domains.

Is there anyone that you want to give a special hello to, somebody who has been working on the project, or members of the community that is stand-out folk in your mind?

I have already mentioned many of those folks. It is everybody on the Nvidia team and all the work they have done. Matt and Jim have been contributing for a long time, but within our own team, Florian Jetter and James Bourbeau, their expertise and contributions to the community have been invaluable. I can't say enough about what they have done to grow the project.

Our API is based on Python. I saw early benchmarks of the 3.11 performance. There are some crazy numbers out there. Can you talk about that? The people in your world trust, are they running like seat by thin or running like crazy versions of the language time?

If you look at how Dask is built, we leverage NumPy Pandas, which pretty much runs exclusively, exclusively on CPython. There is nothing crazy in that regard. The entire Python community is going to benefit tremendously from the work that has being done to try and accelerate the performance of Python. I saw those numbers. They are exciting. It is nice to see the work that has being done there.

It is the middle of May 2022, and there was a story I saw. Some workloads are 50% to 80% faster, which is completely unheard of. Normally, you are talking to a single-digit percentage for a particular workload.

The work that is going on there is exciting. Everybody is going to benefit from that. It is incredibly valuable.

With the workloads that you guys are driving, are they generally running pure Python code as opposed to something that is running in a natively compiled binary?

It depends. If you look at the Dask project, it has broken up into two high-level repositories. You have the Dask collections, which we refer to as the Dask task repo. The collections are Pandas, data frames, arrays, then you have delayed and bags. Data frames and arrays run on top of Pandas and NumPy. You are running down and see code at an execution level. When you talk about bags and delayed for futures, those can certainly be pure Python. It depends on what specific problem a given user is trying to solve. One of the wonderful things about Dask is you have a lot of flexibility in terms of what you are trying to unpack.

It has been fascinating talking to you. I appreciate your time, and I look forward to seeing where Dask and Coiled land on their current projections. It has been super interesting.

Thank you very much. I have enjoyed being here. It has been a pleasure.

Take care.

About

Greg Hayes

Greg Hayes, Senior Director Of Engineering at Coiled. Greg is a technical leader, Data Scientist, and Chemist with over two decades of experience building and leading teams that leverage design thinking to solve real-world problems for users.

In his current role, he leads the Engineering teams at Coiled, a startup that makes it easy for Data Scientists and Engineers to use Dask to scale their Python code to the cloud. In his free time, he enjoys woodworking, reading, and spending time with his family.

Available for talk, coaching and workshops on:

Links from the Episode

No items found.

Success!

We'll keep you up to date with the latest Flagsmith news.

Must be a valid email

Dask

Links from the Episode

Subscribe