What Netflix can teach us about machine learning infrastructure
In November 2018, Ville Tuulos, a machine learning infrastructure architect, was the very first person to publicly dissect and discuss the Netflix machine learning infrastructure at the annual ‘QCon’ software development conference held in San Francisco.
Although this took place almost four years ago, Tuulos’ talk is still highly interesting and relevant and provides a deep dive into the machine learning architecture that has been developed at one of the world’s largest entertainment companies.
In this article, we are going to summarize his talk and discuss some of the concepts that he covers. Alternatively, you can watch the full 49-minute video on YouTube. All image assets used in this article have been taken directly from the video.
Who is Ville Tuulos?
Ville Tuulos is an ML architect who has been developing infrastructure for machine learning for over two decades.
During the course of his career, he has worked as an ML researcher in academia, and as an ML leader at several large companies including Netflix machine learning, where he was active between August 2017 and March 2021. By the time he left Netflix, he was the company’s manager of machine learning infrastructure and led the open-source project Metaflow, a popular open-source framework for data science infrastructure.
Nowadays, Tuulos is the co-founder and CEO of Outbounds, a company that’s developing modern human-centric ML that is based on Metaflow. Tuulos is also the author of Effective Data Science Infrastructure, which discusses how to make data scientists more productive.
Building ML used to be much more difficult than it is today
Tuulos begins his talk by comparing ML infrastructure to an online store, and how building one was a huge technical challenge just two decades ago. Back then, store owners were forced to build the whole system themselves, starting by setting up servers because cloud infrastructure didn’t exist.
Nowadays, however, new platforms (i.e., WordPress & WooCommerce, Shopify, Etsy) have emerged that allow pretty much anyone, even those without technical expertise, to build their own online store. As a result, the biggest challenge of setting one up is more to do with having a good product and knowing your customer than configuring and setting up infrastructure.
He goes on to add that the same thing is going to happen with ML infrastructure. While in recent years companies have had to build their own infrastructure from scratch, platforms like Qwak have solved this problem by providing advanced tooling out of the box. Today, building your own ML infrastructure is largely unnecessary; your time and resources are better spent elsewhere.
Why ML infrastructure is important for Netflix
Just a few years ago, machine learning infrastructure was a major technical pain point at Netflix. Today, largely due to the work of Tuulos and his team, the company’s ML development is becoming more human-centric, and its infrastructure is guided by two key principles:
- Data scientists should be more productive
- It should be easier to apply ML to different business problems
At Netflix, machine learning is being used company-wide because the company recognizes that general ML researchers and data scientists who build ML vision models using Python and TensorFlow, for example, are not the best people to build models to solve numeric problems (i.e., revenue models) using R.
Tuulos emphasizes that while it’s important to hire specialized data scientists for each problem domain, this person isn’t always the DevOps specialist who is also in charge of cloud infrastructure setup. This is where ML infrastructure comes into play.
Tuulos admits that, in the future, there will eventually be some sort of standard solution that enables data scientists to apply machine learning to very different types of problems. At Netflix, however, they wanted to be ahead of the curve and solve their customers’ problems with ML already today. Thus, they needed to build ML infrastructure to achieve this.
What problems should ML infrastructure solve?
According to Tuulos, ML workflows can be divided into eight building blocks:
- Data Store — Data scientists should have access to data, but they shouldn’t be tasked with setting up and maintaining a data store.
- Compute resources — These need to be sufficient for given tasks. Data scientists don’t necessarily care where they come from, however.
- Job scheduler — A scheduling system should be deployed to orchestrate jobs and run pre-planned training daily.
- Collaboration tools — Collaboration and knowledge sharing is becoming more important as the field develops and there’s a need to get to deployment faster.
- Versioning — It’s best practice for data scientists to version their experiments and data. Unfortunately, this isn’t always done because it’s not always easy to do so. Tools like Qwak solve this problem.
- Feature engineering — Data scientists need to understand their data thoroughly so that they can take their time in this critical part of the workflow.
- Deployment — Deployment is a trivial task that most data scientists rarely deal with, especially if they come from a purely research background. This isn’t a problem you have a solution like Qwak that automatically pushes models into production.
- ML libraries — Companies shouldn’t force libraries on their ML teams. As the number of ML applications within a company grows, ML teams should be allowed to choose their own ways of working.
The above slide, created by Tuulos, illustrates the stages that should be considered when building ML infrastructure. The arrows indicate that, in general, the more infrastructure that’s needed to perform a certain step, the less that data scientists will care about how it’s done. In other words, ML infrastructure teams should care more about the things that data scientists don’t care as much about.
From idea to deployment in one week
In his talk, Tuulos describes how his team started a project to analyze sentiments in Tweets written about a Netflix series. Although Netflix already had different tools that allowed them to execute each step in the model-building process, nothing was connecting them.
Although this didn’t impact the build process, it created several problems in production, and this created a lot of questions:
- How should we monitor the models?
- How do we run and access data at scale?
- How do we schedule the model to update daily?
- How do we make this faster?
- How do we iterate on a new version without breaking the production version?
- How do we let another data scientist iterate on her version of the model safely?
Tuulos says that he and his team checked the code and found that 60 percent of it was related to infrastructure and only 40 percent was related to data science. Pondering over the questions above, Tuulos and his team realized that they were missing a piece of infrastructure. Realizing the cost of this, they engineered their own solution: Metaflow, which acts as the link between all these different technologies by wrapping around them.
Prior to building Metaflow, Netflix’s average time from project idea to deployment was four months. Now, the median average is just one week, which shows that working with ML infrastructure enables ML teams to iterate much more quickly. This lends a significant advantage to fast-moving companies like Netflix.
Don’t underestimate the power of ML infrastructure
Utilizing ML infrastructure doesn’t need to be complicated. While building your own would have been a huge challenge once upon a time, modern ML platforms make it super easy to set up, and it’s only going to become easier over time as these platforms continue to innovate their products.
If you take anything away from Tuulos’ talk, it’s that this infrastructure is becoming increasingly important for modern ML development. This is because it helps ML teams to iterate and get their models into production sooner, which is critical when models constantly need to be re-trained in response to new data.
Even if you’re already using tooling to get your models into production, remember that it’s not enough to use different tools for each different step in your workflow. As we have just explored, deploying a dedicated ML infrastructure, Netflix cut down its time from idea to deployment from four months to one week.
If you want to do the same, your best option is to use a standardized platform that supports all the tools, languages, and frameworks that you use in your workflows. That’s where Qwak comes in.
The Qwak platform is a managed platform that unifies ML engineering and data operations, providing agile infrastructure that enables businesses to continuously productionize their ML models at scale. If you’re interested in learning more, check out our platform here.