Skip to main content

Updated Feature Engineering

 Let us now understand how to implement feature engineering. Here are the basic feature engineering techniques widely used,

  • Encoding
  • Binning
  • Normalization
  • Standardization
  • Dealing with missing values
  • Data Imputation techniques

Encoding

Some algorithms work only with numerical features. But, we may have categorical data like “genres of content customers watch” in our example. To convert such categorical data, we use Encoding.

One-hot encoding:

Converting categorical data into columns with each unique category as a column is One-hot Encoding.

Here is the code snippet, to implement One-hot encoding,

encoded_columns = pd.get_dummies(data['column'])
data = data.join(encoded_columns).drop('column', axis=1)

This is widely used when that categorical feature has less unique categories. We need to keep in mind as the unique categories of the categorical feature increases, the dimensions also increase.

Label encoding:

Converting categorical data into numerical by assigning each category a unique integer value is called Label Encoding.

Like 0 for ‘comedy’, 1 for ‘horror’, 2 for ‘romantic ’ randomly. But assigning so may lead to giving unnecessary ordinality to the categories.

This technique can be used when the categories are ordinal (specific order) like 3 for ‘excellent’, 2 for ‘good’ and 1 for ‘bad’. In such cases, giving an order to categories is useful. And the values assigned need not be sequential.

Here is the code snippet, to implement Label encoder,

from sklearn.preprocessing import ColumnTransformer
labelencoder = ColumnTransformer()
x[:, 0] = labelencoder.fit_transform(x[:, 0])

Binning

An opposite situation, occurring less frequently in practice, is when we have a numerical feature but we need to convert it into a categorical one. Binning(also called bucketing)is the process of converting a continuous feature into multiple binary features called bins or buckets, typically based on value range.

Binning of numerical data into 4,8,16 bins
#Numerical Binning Example
Value Bin

0-30 -> Low
31-70 -> Mid
71-100 -> High
#Categorical Binning Example
Value Bin

Germany-> Europe
Italy -> Europe
India -> Asia
Japan -> Asia

The main motivation of binning is to make the model more robust and prevent overfitting, however, it has a cost to the performance. Every time we bin, we sacrifice some information.

Normalization

Normalization (also called, Min-Max normalization) is a scaling technique such that when it is applied the features will be rescaled so that the data will fall in the range of [0,1]

Normalized form of each feature can be calculated as follows:

The mathematical formula for Normalization

Here ‘x’ is the original value and ‘x`’ is the normalized value.

Scatter plot of raw data, normalized data

In the raw data, feature alcohol lies in [11,15] and, feature malic lies in [0,6]. In the normalized data, feature alcohol lies in [0,1] and, feature malic lies in [0,1]

Standardization

Standardization (also called, Z-score normalization) is a scaling technique such that when it is applied the features will be rescaled so that they’ll have the properties of a standard normal distribution with mean,μ=0 and standard deviation, σ=1; where μ is the mean (average) and σ is the standard deviation from the mean.

Standard scores (also called z scores) of the samples are calculated as follows:

The mathematical formula for Standardization

This scales the features in a way that they range between [-1,1]

Scatter plot of raw data, standardized data

In the raw data, feature alcohol lies in [11,15] and, feature malic lies in [0,6]. In the standardized data, feature alcohol and malic are centered at 0.

To know more about feature scaling using normalization and standardization, check my article.

Dealing with missing values

The dataset may contain a few missing values. This may be while entering the data or due to privacy concerns. Whatever the reason may be, understanding how to lessen the impact of it on the results is essential. Here is how missing values can be handled,

  • Simply drop those data points with missing values (this is preferable when the data is huge and the data points with missing values are less)
  • Use algorithms where missing values are handled (depends on the algorithm and library with which it is implemented)
  • Use Data Imputation techniques (depends on the application and data)

Data Imputation techniques

Data imputation is simply replacing the missing values with a value that wouldn't affect the results.

For numerical features, the missing values can be replaced by,

  • simply 0’s or the default value
#Filling all missing values with 0
data = data.fillna(0)
  • most repeated value of the feature
#Filling missing values with mode of the columns
data = data.fillna(data.mode())
  • the mean of that feature (which is impacted by outliers, can be replaced even with the median of the feature)
#Filling missing values with medians of the columns
data = data.fillna(data.median())

For categorical features, the missing values can be replaced by,

  • most repeated value of the feature
#Most repeated value function for categorical columns
data['column_name'].fillna(data['column_name'].value_counts()
.idxmax(), inplace=True)
  • “others” or any newly named category which implies that the data point is imputed

In this article, we understood the basic feature engineering techniques which are widely used. We can create new features based on data and applications. However, if the data is small and soiled, these might not be useful.

Comments

Popular posts from this blog

Feature Importance

 Feature Importance for regression from sklearn . datasets import make _ regression # define dataset X , y = make_regression ( n_samples = 1000 , n_features = 10 , n_informative = 5 , random_state = 1 ) # summarize the dataset print ( X . shape , y . shape ) #linear regression feature importance from sklearn . datasets import make_regression from sklearn . linear_model import LinearRegression from matplotlib import pyplot # define dataset X , y = make_regression ( n_samples = 1000 , n_features = 10 , n_informative = 5 , random_state = 1 ) # define the model model = LinearRegression ( ) # fit the model model . fit ( X , y ) # get importance importance = model . coef _ # summarize feature importance for i , v in enumerate ( importance ) : print ( 'Feature: %0d, Score: %.5f' % ( i , v ) ) # plot feature importance pyplot . bar ( [ x for x in range ( len ( importance ) ) ] , importance ) pyplot . show ( ) feature importance for classification # test cl...