For many organizations, data is an overwhelming wave of information. It’s a chaotic mess that’s impossible to realize
any benefit from. With feature engineering, organizations can make sense of their data and turn it into something
beneficial.
The term feature engineering refers to the process of applying domain knowledge to data by generating features that
transform the data to make it easier to understand and interpret. It usually occurs after the data gathering and
cleaning process and before training machine learning models.
Feature engineering is often part of the ML problem solving workflow:
- Gather data
- Clean it
- Perform feature engineering
- Define the model
- Train the model
- Run tests
- Predict the output
Most of the information used by Artificial Intelligence (AI) is contained in tables. Each row is an observation, and
a column is a feature. Unfortunately, the data is often complicated, irrelevant, missing, or duplicated.
Feature engineering provides a process for transforming data into a format that better represents the underlying
problem. To do this, it makes the data more digestible, putting data into categories to better reflect a finite set of
outcomes or systematically replacing missing values with appropriate estimations.
This process of transforming data with feature engineering is often as much of an art as a science. For example, a
business may want to predict instances of fraud. Raw timestamped transactions could be entered into AI software, but
the output may not be meaningful or actionable. However, a bit of domain expertise helps the data scientist. The
scientist, using their knowledge of retail, creates a new feature that differentiates between the work week and
weekends, as there are always spikes in retail activity during the weekend. Once that context is manually established,
models are better able to spot anomalies, with fewer false positives. That’s the ‘art’ of feature engineering.
Done correctly, feature engineering amplifies the predictive power of Machine Learning (ML) algorithms. It achieves
this by fashioning features out of raw data that feeds and facilitates the ML process. It can be the differentiator
between a good data model and a bad one.
Breaking it down further, the feature engineering part comprises the following steps:
- Brainstorm new, possible features for the model
- Create the features
- Test how efficiently these features work with the model
- Tweak the features, repeat, or go back to the drawing board as needed
- Get the features to work seamlessly with the model
Feature engineering should not be considered a one-time step. It can be used throughout the data science process to
either clean data or enhance existing results. Feature engineering is an iterative process that is interwoven between
data selection, model evaluation, and re-evaluation. The process continues until the data is in a format that is
ingestible by ML models and enables those models to output actionable results.
Examples Of Feature Engineering For Machine Learning
ML algorithms learn solutions to specific problems using the sample data they are presented with. Feature engineering
helps an organization arrange the best representation of their sample data to give the model a chance to learn the
solution to any specific problem.
In feature engineering, representation and relationships matter, and there are four common engineering strategies:
– Resampling imbalanced data
– Creating new features
– Managing missing values
– Detecting outliers
Resampling Imbalanced Data
In its raw form, data is usually imbalanced. Most of the time this can easily be resolved with validation techniques.
But sometimes the imbalance can be large, affecting the outputs. Feature engineering can resolve this by artificially
generating samples in the minority groups. These samples can be used to help address variability or uncertainty in the
data.
Creation of New Features
Creating new features can just be restating data in a different format to match the context of the question. For
example, a company may have the departure and arrival times for trains and turn them into total travel time. Combining
the timestamps into one new feature enables the algorithm to fit the business need and produce more actionable results
Users can also combine two moderately useful features or two features that by themselves are not useful on their own
to create one feature that helps the machine learn better. An example of this is in healthcare where a variety of risk
factors are present but, on their own, don’t indicate a likelihood of a medical event. For example, age, hypertension,
and being a smoker individually don’t predict having a stroke, but the three factors together do.
Feature selection is simply about picking the right independent features that correlate the most with the dependent
feature. All these things combine to make the best possible predictive model. Heatmaps, univariate selection, and the
ExtraTreesClassifier method are all tried-and-tested methods for identifying the features that are related
appropriately.
Feature engineering also helps pick which buckets to create so that the machine can accurately map relevant data to
the right bucket. This includes removing and weeding out unwanted features and noise that helps the model to function
more smoothly.
Managing Missing Values
Missing values are a frequent problem in data, but there are many methods to adequately resolving them during the
data cleansing process.
There are also several advanced engineering techniques that can use existing data to accurately recreate missing
values and complete the dataset, ensuring the data is in a form that models can better utilize.
One method is data deletion. With this method, Feature engineers can remove samples that have missing values. This
works best when only a few samples are incomplete. The more missing values a dataset contains, the more problematic
this method becomes.
Another technique involves replacing missing data with a variable of the mean or median. While this approach resolves
missing data, it can skew the results. If data has a gaussian distribution, then the missing results could be imputed
(a model within a model) so that they match the normal distribution.
These are the two main methods. While there are other methods that can be used to manage missing values, the general
approach is to remove data or input estimated values.
Outlier Detection
Outlier detection is another process that crosses the cleansing/engineering barrier. In the data cleansing step, AI
may simply remove the outliers, suggesting they are errors, or a sample that’s not relevant to the data. However,
that’s a blunt tool and could miss essential information.
In data science, key factors that influence a model’s performance are data handling and data processing. A model
without proper data handling results in an accuracy of about 70%. When feature engineering is applied to the same
model, the performance can greatly improve.
But a good understanding of the data is still needed for feature engineering as it allows a data scientist to specify
thresholds where the data is still logical. For example, a business may have a customer who is 100 years old but
definitely not 1,000 years old. A machine may disregard both data points while a data scientist knows the extra zero
is likely an input error.
This part of the feature engineering process can be long, frustrating, and rely on the skill and domain knowledge of
a data scientist. This is why some view feature engineering in ML as nothing less than an art form.
Advantages Of Feature Engineering
As the adage goes, AI and ML models are only as good as the data they receive. Including feature engineering in the
modeling process can ensure the quality and relevance models receive help them solve real-world problems. But there
are two important things to keep in mind as you proceed:
- Framing The Problem Correctly: Using the right objective measures to estimate the accuracy of the output
- Inter-Dependencies Within The Model: The inherent, underlying structures in the organization’s data. Good structure always provides far better results.
Once these things are considered when selecting or designing features, the advantages of feature engineering include:
- More flexibility and less complexity in models
- Faster processing
- Clear, easy-to-understand models
- Simpler models that are easier to maintain
- A better understanding of the underlying problem
- Better representation of all the available data that is helpful in characterizing the underlying problem
Challenges of Feature Engineering
Data is often unstructured and messy, containing outliers, redundancy, and missing values. Because data comes from
multiple sources, making redundancy and duplicate data are a given. Since data is the starting point for ML, this
results in the following challenges for feature engineering:
- Enormous amounts of data from multiple sources that must be cleansed, aggregated, and analyzed
- Data must be organized into a recognizable structure that models and tools can work with
- Business context and processes must be understood to discern patterns and facilitate analysis
- Insights given must be relevant and actionable for the organization
- Data should be presented in a way that’s easy for people to understand, such as dashboards or graphs
- Timeliness can be a problem, with results taking so long that the results are no longer applicable
- Processes are labor intensive and often must be completed by a data scientist
The Future Of Feature Engineering
Modern technologies are improving the performance of feature engineering. Deep learning as a subset of ML is starting
to reshape the process. Autoencoders and restricted Boltzmann machines are showing promise, automatically learning
abstract feature representation.
The more that computers ‘think’ like humans, the more helpful their feature engineering becomes. Taking heavily
manual tasks from data scientists and allocating them to machines removes cost and time constraints. This means that
data forms such as images, videos, objects, and speech, which are not easily understood by traditional AI that relies
on tables, may be accurately interpreted by machines soon.
New ML models are increasingly offering human-like thought processes, better feature analysis, and higher model
accuracy.
But for now, the field is still reliant on data scientists. The best interpretations of data not only require
knowledge of data science, but also industry or domain knowledge, making this subset of AI a specialized field. Data
interpretation is vital to organizations wanting accurate predictions, and this is the best way to get valid results.
Does Your Organization Need More Accurate Predictions?
Alteryx’s machine learning package offers
Deep Feature Synthesis. This helps to create more accurate models by understanding relationships within your data and
detecting high quality features.
These algorithms give a step up for organizations needing accurate models and predictions, allowing for better
explanations, decision making, and future plans.
Next Term
AutoMLRelated Resources
Customer Story
Protected: Saving Over 75 Hours Day with Automated Forecasting
- Data Prep and Analytics
- Data Science and Machine Learning
- Process Automation
Customer Story
Protected: AAA National Helps Clubs Provide Better Service with Alteryx
- Data Prep and Analytics
- Data Science and Machine Learning
- Analytics Leader