Feature Store

Data Engineering vs Feature Engineering: What Every Data Scientist Should Know

Discover the essential difference between data engineering and feature engineering. Learn why it matters for data scientists.

Hudson Buzby

Solutions Architect at JFrog ML

July 23, 2024

Contents

Data Engineering vs Feature Engineering: What Every Data Scientist Should Know

As more and more organizations begin to introduce machine learning and large language models to their workflows and engineering environments, we are starting to see a shift of focus to the underlying data sources that power and enrich these models. A strong data foundation is essential to any organization looking to implement machine learning and maximize the benefits of generative AI. However, there are many components to a healthy data environment, and it’s important to clearly distinguish between them so you can strategically focus your efforts to maximize engineering effort and value.

Two methodologies that are often compared and confused - data engineering vs feature engineering. While data engineering and feature engineering share many similarities, and feature engineering is a subset of data engineering, they have distinctly different meanings, and it’s important to understand the nuances so you can properly structure your product roadmaps, hire the right persona, or find the right team when something breaks in your machine learning model. In this blog, we’ll define and distinguish the difference between data engineering and feature engineering, as well as walk through an example implementation of both processes and how they interact and exist together.

Data Engineering vs Feature Engineering - What is Data Engineering

Data engineering is the practice of ingesting data from various sources into a storage layer, transforming the data into a usable, logical structure, and making that data available to be used by downstream customers for analysis, reporting, or machine learning. At its core, data engineering involves designing, building, and maintaining robust data pipelines that ensure data flows efficiently and reliably from source systems to storage and processing destinations.

A common framework that is often associated with data engineering is ETL, or extract, transform, and load. There are other frameworks such as ELT or reverse ETL, but at the core, they more or less all follow the same principles.

For the extraction component, data engineering involves writing data jobs or pipelines that ingest data from some source. This source could be external, like an API from a platform, or internal, consuming data from your organization's event stream, changelog, or database. Extraction sources take many forms - REST APIs, S3, Kafka streams, relational databases, files, FTP servers, or data provider platforms such as Fivetran or Airbyte. No matter the source, pipelines need to be developed that pull data on some form of a schedule and write the data to a storage location. In recent years, this storage layer has taken the form of a Data Lake, constructed over object storage, relational databases, or cloud databases such as Snowflake, Redshift, or Bigquery.

Once the data is persisted in a storage layer or data lake, you can begin the process of transformation. Transformation can have many different meanings, but ultimately it entails altering the data in a way to make it usable for analysis, reporting or downstream processing. This could include cleaning, changing the structure of the data, mapping to internal logical layers, unifying schema, or parsing through unstructured data sources like documents, HTML, or JSON into something usable.

The actual processing of transformation can take many forms such as distributed compute platforms like Apache Spark, cloud databases like Snowflake or BigQuery, or if the data is small enough, just simple Python or Scala scripts. At the end of transformation, the data structure should resemble something similar to the schema or format of the expected output report or analysis view.

The last stage of data engineering, loading the data, traditionally involved exporting the resulting data into a separate OLAP database where it could then be reviewed, analyzed, exported and processed. However, with the advent of data lake architectures and platforms like Snowflake, the analytics database is often the same as the transformation database, and there is no need to export the data to another system. The final schema or data model of the tables at the load stage should be clean, make logical sense, and fit the needs of the business intelligence or reporting use case. There can still be outbound, on the fly transformations applied to the data for query or analytics purposes, but at this stage, the data should tell a complete story and answer questions without complex joins, mapping or downstream transformations.

To summarize, data engineering is a multi-stage pipeline that aggregates data from different sources, regardless of the format or structure, consumes them into a storage layer, applies transformations to make the data logical and usable, and then exposes the data to be analyzed, reported on, or consumed by downstream processes.

Data Engineering vs Feature Engineering - What is Feature Engineering

Feature engineering is a critical step in the machine learning pipeline that involves the development and curation of features, or inputs, to be used in the construction, enhancement and deployment of machine learning models. The process of feature engineering requires a deep understanding of both the data and the problem domain to identify the most relevant attributes that can enhance the model’s ability to make accurate predictions. Feature engineering encompasses a variety of techniques, including scaling, encoding categorical variables, creating interaction terms, and extracting useful information from timestamps or text data.

In practice, feature engineering is both science and a bit of witchcraft. It often involves both iteration and experimentation to uncover hidden patterns and relationships within the data. For instance, a data scientist might transform raw sales data into features such as average purchase value, purchase frequency, or customer lifetime value, which can significantly boost the performance of a churn prediction model. By thoughtfully engineering features, practitioners can provide machine learning models with the most informative inputs, ultimately leading to better accuracy and more robust predictions.

While feature engineering may take the form of a dedicated pipeline that ingests, transforms and stores data on some form of a schedule, it can also be less structured. Much of feature engineering takes place in one off notebooks as data scientists test and curate new features to see which yield the best results in the model. If the machine learning model requires recent features during inference or retraining occurs on some form of a schedule, then it would likely make sense to create a dedicated feature engineering pipeline to supply the model with fresh data. But for experimentation purposes, this process often involves querying a data lake for the relevant source data, manipulating, cleaning and formatting the data to be used by the model, and selecting the set of features that will be used moving forward.

Data Engineering vs Feature Engineering

As we mentioned above, feature engineering is certainly a subset of data engineering. It involves the ingestion of data from a source, applying a series of transformations, and making the final result available to be queried by a model for training purposes. You can construct feature engineering pipelines to resemble data engineering pipelines, having schedules, specific source and sink destinations, and availability for querying. However, this configuration would only really apply once you have surpassed the experimentation stage and determined a need for a consistent flow of new feature data.

Functionally, there is nothing to differentiate data vs features - features are data points. Where feature engineering and data engineering really differ is in the objectives and motivations for constructing the pipelines. In general, data engineering serves a broader, more unified purpose than feature engineering. Data engineering platforms are constructed to be flexible and universal, ingesting various types and sources of data into a unified storage location where any number of transformations and use cases can be applied. The intent of a well constructed fact table or gold layer in a data lake is to provide a single source of truth that answers many different questions, produces many reports, and can be consumed by many downstream customers.

And in practice, an organization’s data engineering team will be responsible for the curation and maintenance of all data pipelines, not just those that relate to machine learning. These pipelines may power BI dashboards used by C-Suite, auditing reports that feed payroll, or event logs that show a user’s history of actions within the application.

Feature engineering, on the other hand, serves a specific purpose, finding the tailored inputs and columns that will generate the best predictive results for a machine learning model. Data scientists and machine learning engineers are not tasked with developing a universal data model that will ingest all data points throughout an organization, they just need to select, curate, and clean the data needed to power their models. Now, as machine learning teams grow and begin to incorporate more and more data sources into their models, their feature engineering platform may start to resemble a larger data engineering platform in the tools and methodologies they employ. But, the intent is not to establish flexible data models that can be used throughout the organization - it is simply to power their machine learning models.

Examples

Data Engineering Perspective

Let’s assume that we are a DTC rubber duck company. We run ads on Google, TikTok, and Facebook advertising our product. We’ve been given a directive from top management that we need to see which ads are most effective, measuring clicks, views, and impressions. We run the same photo and video ads across each platform, and we want to see which specific ads were the most effective. Our end goal is to create a dashboard that our CMO can use to track the best performing ads so we can determine which content should continue to be used and promoted.

We’ll start by ingesting data from the Google, Tiktok, and Facebook ads API. This will be our raw data and will provide us with the clicks, views and impressions per platform. We will ingest this data into our data lake.

Next, we’ll begin our transformation layer. We want to find out the best performing ads across all three platforms, so we’ll need to find a way to bring these three different tables into one. Let’s also assume we have a CMS platform that provides a unique internal ID for each ad. And let’s also assume we have a mapping table that lets us link the AD id on each platform to our centralized ad ID in the CMS system. In this process we’ll also clean names, metadata and other categorical information to be consistent across all ads.

Lastly, we’ll join all these clean tables together to have one unified table that contains individual click, view, and impression data per ad. We can now expose this table to be queried by BI analysts so they can summarize and aggregate to create the specific views needed for the dashboard. We’ll configure all of this to run daily using Airflow or another scheduler, and we’ll have a continuously updating feed of ad data. We can also expose and apply this table within the organization for other use cases such as ad costs, conversion rates, or accounting.

Feature Engineering Perspective

On a different team at our duck as a service company, we have interpreted the results from the data engineering platform and determined that some ads are more likely to drive conversion rates. We’d like to create a model that predicts which ad should be displayed to a user, so we can send a one time promotional email including the best ad that will drive product conversion and revenue.

To start, we can use the ad data from the transformation layer of the previous step. We don’t need to reingest the platform specific ad data as it has already been stored and cleaned for us by the data engineering team. We won’t use the final summary table as we want our model to be specific to each platform. Within the dataset, we’ll select and experiment with different column combinations that we may have not needed in the previous summary dashboard task - event_time, ip_address, zip_code, country, page_of_event, gender, age. In a notebook, we manipulate, clean, and transform these features until we find a specific subset that produces a model that generates great ad predictions.

Now that we have our features in place, we select as much training data as we have access to, train our model, build our artifact, store the weights in a model registry, generate the predictions, and hand the result set off to our marketing team so they can send the email with the proper ad suggestions.

The feature data exists as a one off, on the fly dataset used to generate the model. It does not require the complexity or persistence of a full-scale data engineering pipeline. If the campaign is a success and is able to show value from the predictions generated, then maybe the logic of this experiment is reworked and solidified as a scheduled pipeline. But for now, it exists as a single purpose exploration that exists within the larger data engineering environment.

Key Takeaways on Data Engineering vs Feature Engineering

Data engineering provides the necessary infrastructure and processes to ensure data is clean, reliable, and accessible, laying the groundwork for effective feature engineering. Feature engineering, in turn, leverages this well-prepared data to create meaningful features that enhance the predictive capabilities of machine learning algorithms. Both are deeply interconnected and equally essential for building robust, high-performing models, but it’s important to know their distinctions and understand where one ends and the other begins.

With Qwak, you can easily construct and manage feature engineering pipelines that flow into Qwak’s Feature Store and integrate into your machine learning inference environments. Get started today.