Feature Store

Uncovering the Power of Real-Time Feature Engineering in Machine Learning

Explore cutting-edge techniques for real-time feature engineering to enhance your ML models' performance and efficiency.

Grig Duta

Solutions Engineer at Qwak

September 16, 2024

Contents

Uncovering the Power of Real-Time Feature Engineering in Machine Learning

Feature engineering is shifting gears from batch to real-time, and it's happening fast. Traditionally, data scientists primarily used batch processing, which was a natural phase in the field's development given the computational constraints and the nature of early big data systems. This approach involved collecting data, processing it offline, and then repeating the cycle. While effective for its time, it was just one step in the ongoing evolution of data processing techniques.

Now, the game's changed. Real-time feature engineering is becoming the norm, driven by the insatiable appetite for instant insights and the competitive edge they bring. Companies aren't content waiting hours or days for data to be crunched - they want value extracted from their streams as events unfold.This shift to streaming feature engineering isn't just a trend; it's a response to the velocity and volume of modern data. It demands a new approach, new tools, and a different mindset.

In this article, we'll dive into the nitty-gritty of building a streaming feature engineering pipeline using JFrog ML, leveraging Spark’s Pandas API as well as streaming aggregations. We'll show you how to set up an end-to-end system that can handle the firehose of data coming your way, transforming it into valuable features in real-time.

From Batch to Real-Time Feature Engineering

Feature engineering has its roots in the batch processing paradigm. This approach involved collecting large volumes of historical data, processing it offline in scheduled intervals, and using the results to train models. Data scientists would craft features using complex SQL queries, MapReduce jobs, or specialized ETL tools. While effective, this method had inherent limitations - primarily, the lag between data generation and feature availability.

The landscape began to shift as businesses realized the immense value locked in real-time data streams. Real-time feature engineering emerged as a response to this need, allowing for the continuous computation of features as data arrives. This approach enables models to operate on the most current data, dramatically reducing latency between event occurrence and model reaction.

Batch vs Near Real Time (streaming) processing [source]

Key Drivers of the Real-Time Machine Learning Shift

Several factors have accelerated the move towards real-time feature engineering:

1. IoT Explosion: The proliferation of Internet of Things (IoT) devices has led to a deluge of streaming data. These devices generate continuous streams of sensor data, requiring real-time processing to extract meaningful features.

2. Streaming Data Platforms: The maturation of stream processing platforms like Apache Flink, and Apache Spark Streaming, Apache Kafka Streams has made it feasible to process high-velocity data streams at scale.

3. Instant Decision Needs: Many modern applications, from fraud detection to personalized recommendations, require near real-time decisions. Real-time feature engineering enables these systems to make informed choices based on the most up-to-date information.

4. Competitive Advantage: In fast-moving markets, the ability to react instantly to changing conditions can be a significant differentiator. Real-time feature engineering provides the foundation for such agility.

5. Cloud Computing: The widespread adoption of cloud platforms has made it easier and more cost-effective to deploy and scale real-time processing systems.

Keep in mind that this shift isn't just a technological upgrade - it's a fundamental change in how we approach data and extract value from it. Real-time feature engineering is enabling a new class of applications and insights that were previously impractical or impossible. As we move forward, the ability to engineer features in real-time is becoming less of a luxury and more of a necessity for staying competitive in data-driven industries.

The shift to real-time feature engineering, while powerful, isn't without its challenges, which we’ll explore in more detail in the following sections.

Core Concepts in Real-Time Feature Engineering

Real-time feature engineering is all about processing and transforming data on the fly, as it streams in. Unlike batch processing, where we have the luxury of working with static datasets, real-time systems deal with a continuous flow of data. This fundamental difference shapes everything from our architecture to our algorithms.

Definition and Characteristics:

At its core, real-time feature engineering involves creating or updating features as new data arrives, often within milliseconds. The key characteristics include:

Low latency: Processing must occur in sub-second time frames to ensure that insights derived from the data remain relevant. For instance, in a financial trading system, even a delay of a few seconds in processing market data could lead to significant losses. This need for speed often necessitates the use of specialized hardware and optimized algorithms.
Stateful computations: Many features require the system to maintain state over specific time windows. Consider a feature that tracks the average transaction amount for a user over the past hour. As new transactions occur, the system must continuously update this average, adding new data and removing data that falls outside the time window. This requires efficient memory management and the ability to quickly access and update historical information.
Scalability: Is essential in real-time systems, which must be capable of handling sudden spikes in data volume without degradation in performance. For example, an e-commerce platform might experience a surge in traffic during a flash sale. The feature engineering system must scale to process this increased data flow, ensuring that features like real-time product recommendations remain accurate and timely.
Fault tolerance: Real-time pipelines often have stricter requirements for minimizing data loss compared to batch systems, particularly in critical applications like fraud detection in banking. While batch jobs can typically be restarted without significant impact, real-time systems require robust error handling, data replication, and failover mechanisms to ensure reliability and minimize the risk of data loss. However, achieving absolute zero data loss is often impractical, and systems are typically designed to balance data integrity with performance and scalability requirements..

However, the contrast between real-time and batch feature engineering extends beyond these characteristics. Think of batch systems as marathons - you have time to analyze, optimize, and even re-run if needed. They allow for comprehensive analysis and optimization of entire datasets. Engineers can run complex, CPU-intensive computations and even re-run processes if needed.

Real-time is more like a sprint relay - you need to be fast, efficient, and there's no room for do-overs. This necessitates the use of techniques like sliding windows or approximate algorithms. For example, instead of calculating the exact median of a dataset (which would require storing all values), a real-time system might use an approximate algorithm that maintains a reasonably accurate estimate of the median with minimal memory usage.

Types of Features Suitable for Real-Time Engineering:

Time-based features: such as the time since a user's last login or the frequency of actions within a recent time frame, are common.
Sliding window aggregations: allow for computations over recent data, like the average purchase amount in the last hour or the count of events in the past five minutes.
Stateful features: which update based on each new piece of data, might include metrics like customer lifetime value that change with each transaction.

Streaming join features: allowing systems to enrich incoming events with the latest reference data. For instance, a real-time product recommendation system might join incoming user activity data with up-to-date product inventory information to ensure recommendations are both personalized and reflective of current stock levels.

The idea of "feature freshness" is fairly important in real-time systems. Unlike batch processing, where all features for a given sample are typically computed simultaneously, real-time features may have varying update frequencies. Some features might update with every new data point, while others might refresh at set intervals. Managing these differences is key to building effective real-time machine learning systems.

Handling late or out-of-order data presents another significant challenge. In the real world, data is almost guaranteed to arrive out of sequence. Network delays, system failures, or other factors can cause data to arrive late or out of order. Robust real-time feature engineering systems must incorporate strategies to handle these scenarios without compromising feature integrity. This might involve techniques like buffering recent data or using time-based windows that can accommodate late-arriving information.

It's important to note that not all features are suitable for real-time computation. Some complex features might still require periodic batch updates due to their computational intensity or the need for a complete historical dataset. The art of effective system design often lies in combining real-time and batch processing, frequently leveraging a feature store as a unifying layer. This hybrid approach allows systems to benefit from the immediacy of real-time processing while still incorporating more complex, batch-computed features when necessary.

In the following section, we'll demonstrate a practical example of a feature engineering pipeline for fraud detection using JFrog ML's Feature Store. This example will illustrate how the system's capabilities can be applied to real-world machine learning challenges.

Applying Real-Time Feature Engineering: A Credit Card Fraud Detection Case Study

Now that we've explored the core concepts and technologies behind real-time feature engineering, let's see how these principles can be applied to a real-world problem: credit card fraud detection. This use case exemplifies the need for rapid data processing and decision-making, making it an ideal candidate for real-time feature engineering.

In the following sections, we'll walk through a practical implementation of a real-time feature engineering pipeline for fraud detection. We'll examine how the streaming technologies we've discussed can be leveraged to process transaction data on-the-fly, extract meaningful features, and enable quick identification of potentially fraudulent activities. This case study will demonstrate how real-time feature engineering can significantly enhance the responsiveness and effectiveness of fraud detection systems, potentially saving financial institutions and their customers from substantial losses.

Credit Card Fraud Detection in Real Time [source]

Case Study Overview

Fraud detection in credit card transactions is a critical task for financial institutions to protect users from unauthorized and potentially costly transactions. In this context, real-time detection systems offer a significant advantage by identifying and addressing suspicious activities as they occur, rather than after a delay.

Real-time fraud detection focuses on analyzing incoming transaction data immediately, allowing the system to flag potentially fraudulent transactions within seconds. For example, if a credit card is used for a large purchase in another country shortly after a smaller, local transaction, a real-time system can quickly spot this inconsistency and alert the cardholder or block the transaction.

In contrast, batch processing involves collecting and storing data over a period before analyzing it in bulk. While this method is effective for understanding long-term trends and historical patterns, it lacks the immediacy required to detect fraud in real-time. For instance, while batch processing can reveal that a cardholder's spending has increased over several months, it cannot address a sudden suspicious transaction until the next batch analysis is complete.

By leveraging real-time data processing, fraud detection systems can use features that reflect the immediate context of each transaction, such as the time of day or the geographical location of the transaction. This approach helps in identifying anomalies that deviate from normal behavior patterns more quickly and accurately. In contrast, batch systems might rely more on aggregated data and historical trends, which are useful but do not capture immediate transactional anomalies.

The choice between real-time and batch processing depends on the specific needs of the fraud detection system. Real-time systems are essential for scenarios where quick responses are crucial, while batch systems are valuable for in-depth analysis and trend detection over longer periods.

As we delve deeper, we'll examine the dataset used in this case and explore how its attributes are utilized in real-time feature engineering. Understanding the dataset's characteristics will provide context for the specific features we derive and why these features are essential for effective fraud detection.

Dataset

For this real-time fraud detection use case, we utilize a dataset sourced from Kaggle, specifically designed for credit card transaction fraud detection. The dataset provides a rich set of features related to individual transactions, making it suitable for our analysis.

The dataset includes various attributes of credit card transactions, such as transaction amount, date, time, and geographical information. Key columns in the dataset are:

trans_date_trans_time: The date and time when the transaction occurred.
amt: The amount of the transaction.
lat and long: The latitude and longitude of the transaction location.
merch_lat and merch_long: The latitude and longitude of the merchant’s location.
trans_num: A unique identifier for the transaction.
cc_num: The credit card number used for the transaction.

In real-time fraud detection, certain attributes become particularly crucial. Features like the transaction amount, timestamp, and location are pivotal for identifying suspicious activities. For instance, a sudden high-value transaction or a transaction occurring in an unusual location can be indicative of fraud.

We split the feature engineering into three distinct feature sets to handle different aspects of the data effectively. Each feature set focuses on specific types of features to ensure comprehensive coverage and accuracy in fraud detection.

Transaction Features: This set processes basic transaction attributes and derives time-related features such as the hour and minute of the transaction. These features help in understanding immediate patterns and anomalies in individual transactions.
Credit Card Transactions Aggregated: This set aggregates transaction data for each credit card, capturing metrics like the count, average, and sum of transactions over different time windows. These aggregated features help in identifying trends and anomalies based on historical transaction patterns for each card.
Merchant Aggregated Features: This set focuses on aggregating transaction data at the merchant level, analyzing metrics such as transaction counts and amounts across different time windows. This aggregation helps in detecting anomalies related to specific merchants, which might indicate fraudulent activities.

By processing the dataset through these feature sets, we ensure that the model benefits from both immediate and aggregated features, which collectively enhance the accuracy of fraud detection.

With an understanding of the dataset and feature engineering strategy, we can now dive into the specifics of the feature engineering approach using JFrog ML. This will include an overview of the real-time feature engineering pipeline and how each feature set contributes to the overall fraud detection model.

Feature Engineering Pipelines

For this use case, we employ JFrog ML to manage and transform the data efficiently. JFrog ML provides a robust infrastructure for handling real-time data streams and enables seamless integration with feature engineering pipelines.

JFrog ML is designed to support real-time feature engineering and data transformation at scale. Its capabilities are well-suited for processing streaming data, making it an ideal choice for fraud detection systems that need to analyze transactions as they happen. With Qwak, we can define feature sets that continuously update and provide the necessary features for detecting anomalies and potential fraud in real-time.

For this task, we'll use JFrog ML's Feature Store, which uses the concept of Feature Sets. Each Feature Set encompasses the data transformations from raw data to reusable features, which are stored in the Feature Registry and made available offline for training purposes or online for real-time model calls.

Streaming Feature Engineering Pipeline for Financial Transactions Fraud Detection

1. Transaction Features

‍Overview

‍This feature set focuses on extracting and engineering features directly from individual transaction records. This is achieved through a combination of Pandas operations and Pandas-based UDFs within Spark Structured Streaming, as implemented in JFrog ML.

‍Implementation Details

Pandas Operations and UDFs: The feature engineering pipeline begins by applying a series of Pandas operations to the dataframe. These operations include extracting time-based features, such as the hour and minute of the transaction, from the trans_date_trans_time column. Additionally, the Haversine formula is used to compute the geographical distance between the transaction location and the merchant's location.
The resulting features include transaction time details, geographical distance, and basic transaction attributes, which are crucial for immediate fraud detection based on individual transaction anomalies.

Code Example


import pandas as pd
from qwak.feature_store.feature_sets import streaming
from qwak.feature_store.feature_sets.transformations import (Column, Schema, Type, UdfTransformation, qwak_pandas_udf)


# Utility method to extract relevant date and time features
def extract_time_features(df: pd.DataFrame) -> pd.DataFrame:

   df['timestamp'] = pd.to_datetime(df['trans_date_trans_time'])
   df['hour']   = df['timestamp'].dt.hour
   df['minute'] = df['timestamp'].dt.minute

   return df

# Utility method to calculate the Haversine distance between purchase location and the merchant.  
def haversine_vectorized(x_lat, x_lon, y_lat, y_lon):

   # Vectorized Haversine distance as explained here - https://stackoverflow.com/a/51722117
   # <...>

   return distances


# Feature Set schema
@qwak_pandas_udf(output_schema=Schema([
   Column(name="trans_num", type=Type.string),
   Column(name="merchant", type=Type.timestamp),
   Column(name="category", type=Type.timestamp),
   Column(name="cc_num", type=Type.timestamp),
   Column(name="distance", type=Type.timestamp),
   Column(name="timestamp", type=Type.timestamp),
   Column(name="hour", type=Type.int),
   Column(name="minute", type=Type.int)])
   )
def feature_engineering_pipeline(df: pd.DataFrame) -> pd.DataFrame:

   df = extract_time_features(df)

   df['distance'] = haversine_vectorized(df['lat'].values,
                                         df['long'].values,
                                         df['merch_lat'].values,
                                         df['merch_long'].values)

   schema_columns = ["trans_num", "merchant", "category", "cc_num", 
                     "distance", "timestamp", "hour", "minute"]

   return df[schema_columns]

# Feature Set metadata
@streaming.feature_set(
   name="transaction-features",
   key="trans_num",
   data_sources=["cc_transactions"],
   timestamp_column_name="timestamp")
def user_features():
   return UdfTransformation(function=feature_engineering_pipeline)

Key Features

Timestamp: Captures the exact time of the transaction, allowing the system to analyze patterns based on time of day. This can be essential for detecting anomalies such as transactions occurring at unusual hours.
Amount: The value of the transaction is a key feature for identifying outliers. For instance, a sudden high-value transaction that deviates from typical spending patterns may be flagged as suspicious.
Distance: Calculated using the Haversine formula, this feature measures the geographical distance between the transaction location and the merchant's location. A significant deviation from usual transaction locations can indicate potential fraud.

These features help the fraud detection model assess individual transactions in real-time. For example, if a user typically makes small transactions in their local area and a large transaction occurs in a distant location, the system can quickly identify this as a potential red flag and take appropriate action, such as flagging the transaction or notifying the cardholder.

2. Credit Card Transactions Aggregated

Overview

The second feature set aggregates transaction data for each credit card over different time windows. This aggregation is performed using JFrog ML's proprietary aggregation algorithms in conjunction with Spark SQL.

Implementation Details

Spark SQL Transformations: Spark SQL is used to perform the necessary SQL operations that are later fed to the aggregations on the streaming data. This enables real-time processing and window aggregation of large datasets, providing insights into transaction trends and potential anomalies based on aggregated historical data.
Window Aggregation Algorithms: JFrog ML's algorithms efficiently compute various metrics, such as transaction counts, averages, and sums, across specified time windows (e.g., 5 minutes, 1 hour, 1 day). This aggregation helps identify patterns and anomalies based on historical transaction behavior for each credit card.

Code Example


   from qwak.feature_store.feature_sets import streaming
from qwak.feature_store.feature_sets.transformations import (QwakAggregation, SparkSqlTransformation)

@streaming.feature_set(
   name="credit_card_transactions_aggregated",
   key="cc_num",
   data_sources=["cc_transactions_stream"],
   timestamp_column_name="timestamp"
)
def transform():
   sql = """  
       SELECT  cc_num, trans_num, amt,
               to_timestamp(trans_date_trans_time, 'yyyy-MM-dd HH:mm:ss') AS timestamp
      
               -- aggregation required data
               offset, topic, partition
              
       FROM cc_transactions_stream
       """

   return (
       SparkSqlTransformation(sql)
       .aggregate(QwakAggregation.count("trans_num"))
       .aggregate(QwakAggregation.avg("amt"))
       .aggregate(QwakAggregation.sum("amt"))
       .by_windows("5 minutes", "1 hour", "1 day", "7 day")
   )

Key Aggregations

Transaction counts
Average transaction amounts
Sum of transaction amounts

These aggregated features provide a historical context for each credit card's transaction patterns. By comparing current transaction behavior with these aggregated metrics, the system can identify sudden changes or anomalies that may indicate fraudulent activity.

3. Merchant Aggregated Features

Overview

The third feature set focuses on aggregating transaction data at the merchant level. Similar to the second feature set, this aggregation is handled using JFrog ML's proprietary algorithms and Spark SQL.

Implementation Details

Merchant-Level Aggregations: This feature set calculates metrics such as transaction counts, averages, and sums for each merchant over various time windows. By examining transaction patterns at the merchant level, the system can detect anomalies related to specific merchants, which may indicate fraudulent activities.
JFrog ML's Aggregation and SQL Capabilities: Utilizing JFrog ML's aggregation algorithms and Spark SQL, this Feature Set processes and aggregates streaming data efficiently, enabling real-time insights into merchant-level transaction trends and anomalies.

Code Example


from qwak.feature_store.feature_sets import streaming
from qwak.feature_store.feature_sets.transformations import (QwakAggregation, SparkSqlTransformation)

@streaming.feature_set(
   name="merchant_aggregated_features",
   key="merchant",
   data_sources=["cc_transactions_stream"],
   timestamp_column_name="timestamp"
)
def transform():
   sql = """  
       SELECT  merchant, trans_num, amt,
               to_timestamp(trans_date_trans_time, 'yyyy-MM-dd HH:mm:ss') AS timestamp
      
               -- aggregation required data
               offset, topic, partition
              
       FROM cc_transactions_stream
       """

   return (
       SparkSqlTransformation(sql)
       .aggregate(QwakAggregation.count("trans_num"))
       .aggregate(QwakAggregation.avg("amt"))
       .aggregate(QwakAggregation.sum("amt"))
       .by_windows("1 hour", "1 day", "7 days")
   )

These features help in detecting anomalies related to specific merchants, such as unusual transaction volumes or amounts, which may indicate fraudulent activity. By analyzing patterns at the merchant level, the system can identify potential compromises of merchant systems or coordinated fraud attempts targeting specific businesses.

By combining these three feature sets, our fraud detection system gains a comprehensive view of each transaction. It considers individual transaction details, historical patterns for each credit card, and merchant-level trends. This multi-faceted approach improves the system's ability to detect potential fraud in real-time, allowing for quick intervention and prevention of fraudulent activities.

The use of JFrog ML's Feature Store and its advanced streaming capabilities enables us to process and analyze transaction data in real-time, creating a robust and responsive fraud detection feature serving system. This approach demonstrates the power of real-time feature engineering in addressing complex, time-sensitive problems like credit card fraud detection.

Wrapping Up

Throughout this article, we've explored the transformative power of real-time feature engineering, from its conceptual foundations to its practical applications in credit card fraud detection. This shift from batch to real-time processing represents more than just a technological upgrade; it's a fundamental reimagining of how we extract value from data and make decisions in time-critical scenarios.

We examined key technologies facilitating this shift, including Apache Spark Structured Streaming, Apache Flink, and Kafka Streams. Each of these platforms offers distinct capabilities for processing high-velocity data streams, allowing developers to choose the most suitable tool for their specific use case. The credit card fraud detection case study demonstrated a practical implementation of real-time feature engineering. By processing individual transaction features, computing historical aggregations, and analyzing merchant-level patterns in real-time, the system can detect potentially fraudulent activities with minimal delay. This approach illustrates how real-time feature engineering can enhance the accuracy and responsiveness of machine learning models in time-sensitive applications.

However, real-time feature engineering presents specific challenges. These include managing system complexity, ensuring data quality in streaming environments, and maintaining consistent processing across different time windows. Addressing these challenges requires careful system design and robust error handling mechanisms.

Looking ahead, the field of real-time feature engineering is likely to evolve in several directions. Edge computing may reduce latency by processing data closer to its source. Automated feature engineering techniques could streamline the feature discovery process in real-time streams. Additionally, the integration of explainable AI methods may improve the interpretability of real-time models, particularly in regulated industries.

In conclusion, real-time feature engineering represents a significant advancement in data processing and machine learning. As demonstrated in the fraud detection case study, it enables the development of more responsive and accurate systems across various domains. The ongoing evolution of hardware, algorithms, and methodologies will continue to expand the capabilities and applications of real-time feature engineering in data-driven technologies.