Data-centric AI and how to apply it to model development
Two things form the foundations of an AI system: code and data. Both components play an important role in the development of robust architectures and systems. In recent years, however, we’ve seen a shift away from the model-centric approach as the core focus of AI development in preference of the data-centric approach.
The model-centric approach
The model-centric approach focuses on developing experimental research to improve ML model performance. ML teams historically achieved this by selecting the best model architecture and training process from a range of possibilities to develop their own model. Under this approach, data is static; ML teams focus on improving code and model architecture.Â
Although this approach was thought of as the optimum one for quite some time, there has been a seismic shift in the way we view and work with data in recent years. As a result, the data-centric approach to AI development has overtaken the model-centric approach as the best one.Â
What is the data-centric approach?
Although the underlying code that sits at the core of an AI system is *obviously* important, we’re currently living in an age where data is at the core of every single decision-making process.Â
Due to the increasingly important role that data plays in our digitally-connected world, we have progressively moved from a model-centric toward a data-centric approach. This approach prioritizes the systematic refinement of datasets in order to increase the accuracy of ML models and, by extension, AI systems.Â
You can distinguish the two by thinking of them like this—the model-centric approach focuses on refining code as the core objective for improving AI systems while a data-centric approach focuses on refining datasets as the core objective for building more powerful AI systems.Â
Data-centric AI, then, is any system that has been developed with a focus on data instead of code. In a data-centric approach, ML teams spend more of their time managing, labeling, augmenting, and curating their datasets, while the ML model itself remains largely unchanged.
The key takeaway here is that data is the core arbiter of success or failure and is the main focus of development as a result. At the same time, it’s important to remember that this is not an either/or situation; successful AI systems require both strong, well-built models and good data. The challenge for ML teams is achieving the right balance.Â
How to develop more data-centric AI
Although ML teams spend a lot of their time on data preparation and other related tasks, there are other factors to consider when trying to build data-centric systems. Here are three rules of thumb to follow for data-centric AI.Â
1. Datasets should be created with domain experts
Your dataset should be defined iteratively with relevant business domain experts. ML teams commonly (and wrongly) believe that the role of data teams is to simply take a dataset, work a little bit of magic on it, and that’s it. This is not the case. Data teams are experts in representing your world in a format that enables the machine to learn patterns; they’re not experts in the business domain.
You therefore need to know how the reality of the business problem is represented in the dataset. Let’s put together a quick example and imagine that a company’s product director wants to detect cars from images and asks an ML team to build a model. Without any further input from the product director, the ML team could develop a training set that only includes actual cars. However, perhaps the product director also wanted to also identify products that look like or are shaped like cars.Â
Using supervised learning, you need to define the inputs and outputs to do so. In this example, that would involve the images in the dataset (the input) and their labels (the output). On the other hand, under unsupervised learning, it’s even more important that your problem is well-defined and represented by the data, i.e., by only having cars in the images and nothing else—this is unrealistic because images of cars are typically going to include other objects.
The best way to start is by defining the problem that you’re trying to solve together with data scientists and domain experts, then creating feedback loops and iterating on the data as you progress through the dataset.Â
2. Datasets should be sufficient
Models need access to sufficient data to learn patterns from data and cancel out the inevitable noise that’s present in real life.Â
Let’s return to the example of the cars and say that you’ve got a dataset that consists of 10 images. Seven of these images include cars whereas three of the other images include cars that have bicycles on top. If you train your model using this dataset, then the model will create a world where there are abnormalities for 30 percent of the time. By leaving out the noise, however, you might end up with a model that has too narrow of a scope to operate in real life because it won’t know how to cope with abnormalities. The best solution here is to have more “normal” examples for the model to learn from.Â
How many, then, is enough? This depends entirely on the problem that you are trying to solve. For some problems, a 50/50 chance might be sufficient whereas other problems might need a 99/1 chance. Defining what is enough is, again, something that needs to be defined in collaboration with domain experts.
3. Datasets should be representative
Even if you’ve got what looks like the perfect dataset that represents the problem you’re trying to tackle, you still need to make sure that it’s also representative of the real world; it’s not enough for it to only be representative of your problem.Â
Furthermore, even if your dataset is representative of the real world now, that’s not to say that it will be further down the line. A dataset that’s representative on day zero might not be on day 100, and this is why checking for data drift is an important, continuous process. Once a change in the underlying data occurs, it’s important to create a new version of your training dataset and re-train your models.Â
Applying a data-centric approach to AI development
We’ve talked a lot about how to develop more data-centric AI, but how can you apply the principles of a data-centric approach to AI development?
By leveraging MLOps best practices
Data-centric AI focuses on spending more time on data compared to the model. Time spent on model improvement includes model selection, experiment tracking, hyperparameter optimization, and model deployment. Automating and streamlining these lifecycle processes therefore plays a very important role in any data-centric approach. You can read more about the principles of MLOps on our blog:
Use tools to improve data quality
There are three “pillars” of quality when it comes to data:
- Quality of data labels: Labels provide information about the content of data, and it’s important for algorithms to train on accurately and consistently labeled data.Â
- Representative data: Gaps and missing information in data can lead to inaccurate results. It’s important to have training data that contains enough data points for different classes and accurately represents the real world.Â
- Unbiased data: Building AI systems involves human decision-making in areas like data collection and labeling, which naturally leads to biases, and the outcome of AI models will reflect these. While it’s impossible to totally eliminate bias, you can minimize it with careful design.Â
Involve domain expertise
We have already mentioned this, but it’s so important that it’s worth bringing up again. You must create datasets with domain knowledge; this is essential for a data-centric approach.Â
Different industries, business functions, and even problems within the same domain can easily have intricacies that will go unnoticed by data teams. Domain expertise can therefore provide the basic truth for a specific business use case where you’re trying to apply AI and determine if the dataset accurately represents the problem that the system is being designed to solve.Â
For example, if you’re developing an AI system for the predictive maintenance of a large solar deployment, you will need engineers, product experts, and maintenance workers in addition to data and ML teams that are building the model. This is because these domain experts are suited to bridging the gap between their own knowledge and the existing knowledge of your teams.Â
Ready to adopt a data-centric approach to AI?
Qwak is the full-service machine learning platform that enables teams to take their models and transform them into well-engineered products. Our cloud-based platform removes the friction from ML development and deployment while enabling fast iterations, limitless scaling, and customizable infrastructure.‍
‍Want to find out more about how Qwak could help you deploy your ML models effectively? Get in touch for your free demo!