Let us now understand how to implement feature engineering. Here are the basic feature engineering techniques widely used,
- Encoding
- Binning
- Normalization
- Standardization
- Dealing with missing values
- Data Imputation techniques
Encoding
Some algorithms work only with numerical features. But, we may have categorical data like “genres of content customers watch” in our example. To convert such categorical data, we use Encoding.
One-hot encoding:
Converting categorical data into columns with each unique category as a column is One-hot Encoding.
Here is the code snippet, to implement One-hot encoding,
encoded_columns = pd.get_dummies(data['column'])
data = data.join(encoded_columns).drop('column', axis=1)This is widely used when that categorical feature has less unique categories. We need to keep in mind as the unique categories of the categorical feature increases, the dimensions also increase.
Label encoding:
Converting categorical data into numerical by assigning each category a unique integer value is called Label Encoding.
Like 0 for ‘comedy’, 1 for ‘horror’, 2 for ‘romantic ’ randomly. But assigning so may lead to giving unnecessary ordinality to the categories.
This technique can be used when the categories are ordinal (specific order) like 3 for ‘excellent’, 2 for ‘good’ and 1 for ‘bad’. In such cases, giving an order to categories is useful. And the values assigned need not be sequential.
Here is the code snippet, to implement Label encoder,
from sklearn.preprocessing import ColumnTransformer
labelencoder = ColumnTransformer()
x[:, 0] = labelencoder.fit_transform(x[:, 0])Binning
An opposite situation, occurring less frequently in practice, is when we have a numerical feature but we need to convert it into a categorical one. Binning(also called bucketing)is the process of converting a continuous feature into multiple binary features called bins or buckets, typically based on value range.

#Numerical Binning Example
Value Bin
0-30 -> Low
31-70 -> Mid
71-100 -> High#Categorical Binning Example
Value Bin
Germany-> Europe
Italy -> Europe
India -> Asia
Japan -> Asia
The main motivation of binning is to make the model more robust and prevent overfitting, however, it has a cost to the performance. Every time we bin, we sacrifice some information.
Normalization
Normalization (also called, Min-Max normalization) is a scaling technique such that when it is applied the features will be rescaled so that the data will fall in the range of [0,1]
Normalized form of each feature can be calculated as follows:

The mathematical formula for Normalization
Here ‘x’ is the original value and ‘x`’ is the normalized value.


In the raw data, feature alcohol lies in [11,15] and, feature malic lies in [0,6]. In the normalized data, feature alcohol lies in [0,1] and, feature malic lies in [0,1]
Standardization
Standardization (also called, Z-score normalization) is a scaling technique such that when it is applied the features will be rescaled so that they’ll have the properties of a standard normal distribution with mean,μ=0 and standard deviation, σ=1; where μ is the mean (average) and σ is the standard deviation from the mean.
Standard scores (also called z scores) of the samples are calculated as follows:

The mathematical formula for Standardization
This scales the features in a way that they range between [-1,1]


In the raw data, feature alcohol lies in [11,15] and, feature malic lies in [0,6]. In the standardized data, feature alcohol and malic are centered at 0.
To know more about feature scaling using normalization and standardization, check my article.
Dealing with missing values
The dataset may contain a few missing values. This may be while entering the data or due to privacy concerns. Whatever the reason may be, understanding how to lessen the impact of it on the results is essential. Here is how missing values can be handled,
- Simply drop those data points with missing values (this is preferable when the data is huge and the data points with missing values are less)
- Use algorithms where missing values are handled (depends on the algorithm and library with which it is implemented)
- Use Data Imputation techniques (depends on the application and data)
Data Imputation techniques
Data imputation is simply replacing the missing values with a value that wouldn't affect the results.
For numerical features, the missing values can be replaced by,
- simply 0’s or the default value
#Filling all missing values with 0
data = data.fillna(0)- most repeated value of the feature
#Filling missing values with mode of the columns
data = data.fillna(data.mode())- the mean of that feature (which is impacted by outliers, can be replaced even with the median of the feature)
#Filling missing values with medians of the columns
data = data.fillna(data.median())For categorical features, the missing values can be replaced by,
- most repeated value of the feature
#Most repeated value function for categorical columns
data['column_name'].fillna(data['column_name'].value_counts()
.idxmax(), inplace=True)- “others” or any newly named category which implies that the data point is imputed
In this article, we understood the basic feature engineering techniques which are widely used. We can create new features based on data and applications. However, if the data is small and soiled, these might not be useful.
Comments
Post a Comment