Minkyung’s blog

Feature Scaling: Standardization vs. Normalization And Various Types of Normalization

2020-12-27T04:00:00+00:00

Why Do We Need Scaling?
Standardization vs. Normalization
Scalers Deep Dive
Summary
References

Why Do We Need Scaling?

Features on a different scale is a common issue data scientists encounter. Some algorithms can handle it but some don’t and if features are not scaled properly beforehand, we will have a hard time finding the optimal solution. So, why does it matter and how can we solve it?

First, let’s think about KNN, which uses Euclidean distance to determine the similarity between points. When calculating the distance, features with a bigger magnitude will influence much more than the ones with a smaller magnitude, which leads to a solution dominated by the bigger features. Another example is algorithms using gradient descent. Features on a different scale will have different step sizes and it will take a longer time to converge as shown below.

Gradient descent without scaling (left) and with scaling (right) (Image source)

The almost only exception is tree-based algorithms that use Gini impurity or information gain which are not influenced by feature scale. Here is some examples of machine learning models sensitive and non-sensitive to feature scale:

ML Models sensitive to feature scale

Algorithms that use gradient descent as an optimization technique
- Linear and Logistic Regression (may not use Gradient Descent)
- Neural Networks
Distance-based algorithms
- Support Vector Machines
- KNN
- K-means clustering
Algorithms that find directions that maximize the variance
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)

ML models not sensitive to feature scale

Tree-based algorithms
- Decision Tree
- Random Forest
- Gradient Boosted Trees

Standardization vs. Normalization

How can we scale features then? There are two types of scaling techniques depending on their focus: 1) standardization and 2) normalization.

Standardization focuses on scaling the variance in addition to shifting the center to 0. It comes from the standardization in statistics, which converts a variable into $z-{score}$ that represents the number of standard deviations away from the mean no matter what the original value is.

Normalization focuses on scaling the min-max range rather than variance. For example, the original value range of [100, 200] is simply scaled to be [0, 1] by substracting the minimum value and dividing by the range. There are a few variations of normalization depending on whether it centers the data and what min/max value it uses: 1) min-max normalization, 2) max-abs normalization, 3) mean normalization, and 4) median-quantile normalization.

Each scaling method has its own advantages and limitations and there is no method that works for every situation. We should understand each method, implement them, and see which one works best for a specific problem. In the remaining sections of this post, I will explain the definition, advantages, limitations, and Python implementation of all of the mentioned scaling methods.

Scalers Deep Dive

# define functions used in this post

def kdeplot(df, scaler_name):
    fix, ax = plt.subplots(figsize=(7, 5))
    for feature in df.columns:
        sns.kdeplot(df[feature], ax=ax)
    ax.set_title(scaler_name)
    ax.set_ylabel('Density')
    ax.set_xlabel('Feature value')
    plt.tight_layout()
    
def kdeplot_with_zoom(df, scaler_name, xlim):
    fix, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    # original  
    for feature in df.columns:
        sns.kdeplot(df[feature], ax=ax1)
    ax1.set_title(scaler_name)
    ax1.set_ylabel('Density')
    ax1.set_xlabel('Feature value')
    
    # zoomed 
    for feature in features:
        sns.kdeplot(df[feature], ax=ax2)
    ax2.set_xlim(xlim)
    ax2.set_title(scaler_name + ' (zoomed)')    
    ax2.set_ylabel('Density')
    ax2.set_xlabel('Feature value')
    
    plt.tight_layout()

Original Data

We will use the boston housing-prices dataset available in sklearn library to demonstrate the effect of each scaler. Among total 13 variables, we will focus on 6 variables for easier visualization: ‘RM’, ‘LSTAT’, ‘CRIM’, ‘AGE’, ‘DIS’, ‘NOX’. As always, we split the data into train and test sets and use the train set for feature engineering to prevent data leakage during testing although we will not cover testing in this post.

# import modules 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# load data
boston_dataset = load_boston()
df = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)

# take only variables we will experiment 
features = ['RM', 'LSTAT', 'CRIM', 'AGE', 'DIS', 'NOX']
df = df[features]
df['MEDV'] = boston_dataset.target # add target 

# split data
X_train, X_test, y_train, y_test = train_test_split(df[features], 
                                                    df['MEDV'], 
                                                    test_size=0.3, 
                                                    random_state=0)

When the original distributions of all features are displayed in one plot, we can quickly tell that they are not on the same scale. Some features seem to be clustered in a smaller range, such as ‘NOX’ or ‘RM’, and some are spread across a wider range, such as ‘LSTAT’.

kdeplot(X_train, 'Original')

To quantify the difference in scale between features, we can check some statistics such as mean, standard deviation, minimum, or maximum of observations within each feature. Indeed, they are all very different in their scale and this will be a problem when training certain types of model that requires data to be on the same scale.

X_train.describe().loc[['mean', 'std', 'min', 'max'], :]

	RM	LSTAT	CRIM	AGE	DIS	NOX
mean	6.308427	12.440650	3.358284	68.994068	3.762459	0.556098
std	0.702009	7.078485	8.353223	28.038429	2.067661	0.115601
min	3.561000	1.730000	0.006320	2.900000	1.174200	0.385000
max	8.780000	36.980000	88.976200	100.000000	12.126500	0.871000

1. Standardization

One of the most commonly used techniques is standardization, which scales data so different features have the same mean and standard deviation.

Definition

Center data at 0 and set the standard deviation to 1 (variance=1)

\[X' = \frac{X - \mu}{\sigma}\]

where $\mu$ is the mean of the feature and $\sigma$ is the standard deviation of the feature

The output value is also called Z-score which represents how many standard deviations a value is away from the mean of the feature

Advantages

All features have the same mean and variance, making it easier to compare
It is less sensitive to extreme outliers than min-max normalizer
It preserves the original distribution (If the original distribution is normal distribution, the transformed data will also be normal distribution. Same for skewed distribution)

Limitations

It works best when the original distribution is normal distribution (recommended to transform data to normal distribution beforehand)
It is still affected by outliers as the mean and standard deviation used in the formula is affected by extreme outliers
There is no fixed bounding range and features have different ranges
It preserves outliers

Let’s see the standardization output of our data shown below. Every feature is centered at around 0 and with the same variance (look at the width of main curves). However, the x limit of each variable differs, especially the ones with extreme outliers such as ‘CRIM’ have a much longer tail tapping 10. These extreme outliers might work adversely when training a model.

from sklearn.preprocessing import StandardScaler

# fit the scaler to the train set 
scaler_std = StandardScaler().fit(X_train)

# transform data
X_train_scaled_std = scaler_std.transform(X_train)

# put them in dataframe 
X_train_scaled_std = pd.DataFrame(X_train_scaled_std, columns=X_train.columns)

# plot 
kdeplot(X_train_scaled_std, 'StandardScaler')

2. Normalization

Normalization overcomes standardization’s limitation of varying range across features by focusing on limiting the bounding range. The main idea is dividing the values by the maximum or the total range of variables so that every value lies within a fixed range.

2.1. Min-max Normalization

Definition

Scale the feature so it has a fixed range such as [0, 1]

\[X' = \frac{X - \text{min}(X)}{\text{max}(X) - \text{min}(X)}\]

Advantages

Every feature has the same range of [0, 1], removing potentially negative impacts of extreme values

Limitations

The mean and variance vary between features
It may alter the shape of the original distribution
It is sensitive to extreme outliers
The majority of data will be centered within a small range if there are extreme outliers

When applying this to our data, we can see every feature is now within the same range. Note that this graph looks like there are values smaller than 0 and greater than 1, but it is because we are estimating density function from our non-smooth data and the actual values fall within 0 and 1 as you can see in the summary table.

You can also see ‘CRIM’ has a majority of observations at around 0 and quickly fades out after that. This is because of extreme outliers. It would be wise to remove those outliers beforehand so the values are spread more evenly, which will help training.

from sklearn.preprocessing import MinMaxScaler

# fit the scaler to the train set 
scaler_minmax = MinMaxScaler().fit(X_train)

# transform data
X_train_scaled_minmax = scaler_minmax.transform(X_train)

# put them in dataframe 
X_train_scaled_minmax = pd.DataFrame(X_train_scaled_minmax, columns=X_train.columns)

# plot
kdeplot(X_train_scaled_minmax, 'Min-max Normalization')

# summary table shows min 0 and max 1 for every feature
X_train_scaled_minmax.describe().loc[['mean', 'std', 'min', 'max'], :]

	RM	LSTAT	CRIM	AGE	DIS	NOX
mean	0.526428	0.303848	0.037675	0.680680	0.236321	0.352054
std	0.134510	0.200808	0.093888	0.288758	0.188788	0.237861
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

2.2. Maximum absolute normalization

When there are both positive and negative values, it might be wise to keep the sign and only scale the magnitude, so the range becomes roughly [-1, 1]. For instance, if the original feature range is [-50, 50], then we can map it to [-1, 1] by simply dividing the values by the maximum absolute value. This is where the max-abs normalizer comes in.

Definition

Scale the feature so it has a fixed range such as [-1, 1]

\[X' = \frac{X}{max(\lvert X \lvert)}\]

This is the same as min-max normalizer if the minimum value is 0 and all values are positive

Advantages

It is handy for features with both positive and negative values as it keeps the sign of values
It does not shift or center the data, so it does not destroy any sparsity. This technique is often used in sparse data.

Limitations

The mean and variance vary between features
It may alter the shape of the original distribution
It is sensitive to extreme outliers
The majority of data will be centered within a small range if there are extreme outliers

The graph below is almost the same as the result of the min-max normalizer, as all of the features are positive values.

from sklearn.preprocessing import MaxAbsScaler

scaler_maxabs = MaxAbsScaler().fit(X_train)

# transform data
X_train_scaled_maxabs = scaler_maxabs.transform(X_train)

# put them in dataframe 
X_train_scaled_maxabs = pd.DataFrame(X_train_scaled_maxabs, columns=X_train.columns)

# plot
kdeplot(X_train_scaled_maxabs, 'Maximum Absolute Scaler')

2.3. Mean Normalization

The mean normalizer is the same as the min-max normalizer but, instead of setting the minimum to 0, it sets the mean to 0.

Definition

Center the feature at 0 and rescale the feature to [-1, 1]

\[X' = \frac{X-\mu}{\text{max}(X) - \text{min}(X)}\]

Advantages

Every feature has the same range of [-1, 1], removing potentially negative impacts of extreme values

Limitations

The mean and variance vary between features
It may alter the shape of the original distribution
It is sensitive to extreme outliers
The majority of data will be centered within a small range if there are extreme outliers

Unfortunately, there is no specialized function for mean normalization in scikit-learn. Instead, we can use the combination of StandardScaler to remove the mean and RobustScaler to dividing the values by the total value range.

You can see that now all features are centered around 0 while keeping the min-max range the same across them. This will be handy when applying machine learning models. However, the variance still varies across them, keeping ones with extreme outliers (e.g. ‘CRIM’) mostly clustered at 0.

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

# StandardScaler to remove the mean but not scale
scaler_mean = StandardScaler(with_mean=True, with_std=False)

# RobustScaler to divide values by max-min
# Important to keep the quantile range to 0 to 100 (min and max values)
scaler_minmax = RobustScaler(with_centering=False,
                            with_scaling=True,
                            quantile_range=(0, 100))

# fit the scaler to the train set 
scaler_mean.fit(X_train)
scaler_minmax.fit(X_train)

# transform train and test sets
X_train_scaled = scaler_minmax.transform(scaler_mean.transform(X_train))

# put them in dataframe 
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)

# plot
kdeplot(X_train_scaled, 'Mean Normalization')

2.4. Median-quantile Normalization

The final method is median-quantile normalization, which is also called a robust scaler. It is called robust because it is robust to extreme outliers.

Definition

Set the median to 0 and scale to the inter-quantile range (range between 25th quantile and 75th quantile)

\[X' = \frac{X-\text{median}(X)}{\text{75th quantile}(X) - \text{25th quantile}(X)}\]

Advantages

It is robust to outliers so it is used for data with outliers
It produces a better spread of data for skewed distribution

Limitations

The variance and value range differs between features
It may not preserve the shape of the original distribution
It preserves outliers

As you can see in the right graph below, all of the transformed data has better spread and none of the features shows high concentration within a small range, unlike other scaling techniques we reviewed so far. However, like the graph on the left, some features with extreme outliers (e.g. ‘CRIM’) show a very wide range of values. This method does not set the fixed value range so the extreme values still exist in the data.

from sklearn.preprocessing import RobustScaler 

scaler_rbs = RobustScaler().fit(X_train)
X_train_scaled_rbs = scaler_rbs.transform(X_train)
X_train_scaled_rbs = pd.DataFrame(X_train_scaled_rbs, columns=X_train.columns)

kdeplot_with_zoom(X_train_scaled_rbs, 'Median-quantile Normalization', [-5, 5])

Summary

We have reviewed five different scaling methods including one standardization and four normalization methods. Like mentioned earlier, there is no method that works for every problem. We would need to try different scalers and find the one that works best for a specific application. However, the rule of thumb is, try to use standardization or min-max normalization first and see if other methods or tweaks need to be applied. Some criteria to consider are: 1) does the algorithm prefer data to be centered at 0? 2) does the algorithm prefer data to be in a fixed range? Also, it is wise to handle outliers beforehand if necessary.

Here is the summary of the scaling methods reviewed in this post:

Category	Standardization	Min-max Normalization	Max-abs Normalization	Mean Normalization	Median-quantile Nromalization
Concepts	Centering + Unit Variance	Fixed Range	Fixed Range	Centering + Fixed Range	Centering + Fixed Quantile Range
Definition	Convert data to have zero mean and unit variance	Convert data to be within fixed range (e.g. [0, 1])	Convert data to be within fixed range (e.g. [-1, 1])	Convert data to have zero mean and be within fixed range (e.g. [-1, 1])	Convert data to have zero median and unit interquantile range
Sklearn class	StandardScaler	MinMaxScaler	MaxAbsScaler	StandardScaler + RobustScaler * StandardScaler only for mean removal and RobustScaler for scaling	RobustScaler
Benefits	- Less sensitive to outliers - Easier to compare and learn - Preserves original distribution	- Features in the same range	- Features in the same range - Preserves the sign (good for pos/neg mix)	- Features in the same range	- Least sensitive to outliers - Better spread of values for skewed distribution
Limitations	- Range varies between variables - Preserves outliers	- Sensitive to outliers - Mean and variance vary between features	- Sensitive to outliers - Mean and variance vary between features	- Sensitive to outliers -Variance varies between features	- Range varies between features -Variance varies between features

References

How To Use FB Prophet for Time-series Forecasting: Vehicle Traffic Volume

2020-12-15T04:00:00+00:00

Recently, I came across a few articles mentioning Facebook’s Prophet library that looked interesting (although the initial release was almost 3 years ago!), so I decided to dig more into it.

Prophet is an open-source library developed by Facebook which aims to make time-series forecasting easier and more scalable. It is a type of generalized additive model (GAM), which uses a regression model with potentially non-linear smoothers. It is called additive because it adds multiple decomposed parts to explain some trends. For example, Prophet uses the following components:

\[y(t) = g(t) + s(t) + h(t) + e(t)\]

where,
$g(t)$: Growth. Big trend. Non-periodic changes.
$s(t)$: Seasonality. Periodic changes (e.g. weekly, yearly, etc.) represented by Fourier Series.
$h(t)$: Holiday effect that represents irregular schedules.
$e(t)$: Error. Any idiosyncratic changes not explained by the model.

In this post, I will explore main concepts and API endpoints of the Prophet library.

Prepare Data
Train And Predict
Check Components
Evaluate
Trend Change Points
Seasonality Mode
Saving Model
References

1. Prepare Data

In this post. We will use the U.S. traffic volume data available here, which is a monthly traffic volume (miles traveled) on public roadways from January 1970 until September 2020. The unit is a million miles.

import pandas as pd
import matplotlib.pyplot as plt

# to mute Pandas warnings Prophet needs to fix
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

df.head()

	DATE	TRFVOLUSM227NFWA
0	1970-01-01	80173.0
1	1970-02-01	77442.0
2	1970-03-01	90223.0
3	1970-04-01	89956.0
4	1970-05-01	97972.0

Prophet is hard-coded to use specific column names; ds for dates and y for the target variable we want to predict.

# Prophet requires column names to be 'ds' and 'y' 
df.columns = ['ds', 'y']
# 'ds' needs to be datetime object
df['ds'] = pd.to_datetime(df['ds'])

When plotting the original data, we can see there is a big, growing trend in the traffic volume, although there seems to be some stagnant or even decreasing trends (change of rate) around 1980, 2008, and most strikingly, 2020 . Checking how Prophet can handle these changes would be interesting. There is also a seasonal, periodic trend that seems to repeat each year. It goes up until the middle of the year and goes down again. Will Prophet capture this as well?

For train test split, do not forget that we cannot do a random split for time-series data. We use ONLY the earlier part of data for training and the later part of data for testing given a cut-off point. Here, we use 2019/1/1 as our cut-off point.

# split data 
train = df[df['ds'] < pd.Timestamp('2019-01-01')]
test = df[df['ds'] >= pd.Timestamp('2019-01-01')]

print(f"Number of months in train data: {len(train)}")
print(f"Number of months in test data: {len(test)}")

    Number of months in train data: 588
    Number of months in test data: 21

2. Train And Predict

Let’s train a Prophet model. You just initialize an object and fit! That’s all.

Prophet warns that it disabled weekly and daily seasonality. That’s fine because our data set is monthly so there is no weekly or daily seasonality.

from fbprophet import Prophet 

# fit model - ignore train/test split for now 
m = Prophet()
m.fit(train)

    INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
    INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

    <fbprophet.forecaster.Prophet at 0x121b8dc88>

When making predictions with Prophet, we need to prepare a special object called future dataframe. It is a Pandas DataFrame with a single column ds that includes all datetime within the training data plus additional periods given by user.

The parameter periods is basically the number of points (rows) to predict after the end of the training data. The interval (parameter freq) is set to ‘D’ (day) by default, so we need to adjust it to ‘MS’ (month start) as our data is monthly. I set periods=21 as it is the number of points in the test data.

# future dataframe - placeholder object
future = m.make_future_dataframe(periods=21, freq='MS') 

# start of the future df is same as the original data 
future.head()

	ds
0	1970-01-01
1	1970-02-01
2	1970-03-01
3	1970-04-01
4	1970-05-01

# end of the future df is original + 21 periods (21 months)
future.tail()

	ds
604	2020-05-01
605	2020-06-01
606	2020-07-01
607	2020-08-01
608	2020-09-01

It’s time to make actual predictions. It’s simple - just predict with the placeholder DataFrame future.

# predict the future
forecast = m.predict(future)

Prophet has a nice built-in plotting function to visualize forecast data. Black dots are for actual data and the blue line is prediction. You can also use matplotlib functions to adjust the figure, such as adding legend or adding xlim or ylim.

# Prophet's own plotting tool to see 
fig = m.plot(forecast)
plt.legend(['Actual', 'Prediction', 'Uncertainty interval'])
plt.show()

3. Check Components

So, what is in the forecast DataFrame? Let’s take a look.

forecast.head()

	ds	trend	yhat_lower	yhat_upper	trend_lower	trend_upper	additive_terms	additive_terms_lower	additive_terms_upper	yearly	yearly_lower	yearly_upper	yhat
0	1970-01-01	94281.848744	69838.269924	81366.107613	94281.848744	94281.848744	-18700.514310	-18700.514310	-18700.514310	-18700.514310	-18700.514310	-18700.514310	75581.334434
1	1970-02-01	94590.609819	61661.016554	73066.758942	94590.609819	94590.609819	-27382.307301	-27382.307301	-27382.307301	-27382.307301	-27382.307301	-27382.307301	67208.302517
2	1970-03-01	94869.490789	89121.298723	99797.427717	94869.490789	94869.490789	37.306077	37.306077	37.306077	37.306077	37.306077	37.306077	94906.796867
3	1970-04-01	95178.251864	89987.904019	101154.016322	95178.251864	95178.251864	166.278079	166.278079	166.278079	166.278079	166.278079	166.278079	95344.529943
4	1970-05-01	95477.052904	99601.487207	110506.849617	95477.052904	95477.052904	9672.619044	9672.619044	9672.619044	9672.619044	9672.619044	9672.619044	105149.671948

There are many components in it but the main thing that you would care about is yhat which has the final predictions. _lower and _upper flags are for uncertainty intervals.

Final predictions: yhat, yhat_lower, and yhat_upper

Other columns are components that comprise the final prediction as we discussed in the introduction. Let’s compare Prophet’s components and what we see in our forecast DataFrame.

\[y(t) = g(t) + s(t) + h(t) + e(t)\]

Growth ($g(t)$): trend, trend_lower, and trend_upper
Seasonality ($s(t)$): additive_terms, additive_terms_lower, and additive_terms_upper
- Yearly seasonality: yearly, yearly_lower, andyearly_upper

The additive_terms represent the total seasonality effect, which is the same as yearly seasonality as we disabled weekly and daily seasonalities. All multiplicative_terms are zero because we used additive seasonality mode by default instead of multiplicative seasonality mode, which I will explain later.

Holiday effect ($h(t)$) is not present here as it’s yearly data.

Prophet also has a nice built-in function for plotting each component. When we plot our forecast data, we see two components; general growth trend and yearly seasonality that appears throughout the years. If we had more components such as weekly or daily seasonality, they would have been presented here as well.

# plot components
fig = m.plot_components(forecast)

4. Evaluate

4.1. Evaluate the model on one test set

How good is our model? One way we can understand the model performance, in this case, is to simply calculate the root mean squared error (RMSE) between the actual and predicted values of the above test period.

from statsmodels.tools.eval_measures import rmse

predictions = forecast.iloc[-len(test):]['yhat']
actuals = test['y']

print(f"RMSE: {round(rmse(predictions, actuals))}")

    RMSE: 32969.0

However, this probably under-represents the general model performance because our data has a drastic change in the middle of the test period which is a pattern that has never been seen before. If our data was until 2019, the model performance score would have been much higher.

4.2. Cross validation

Alternatively, we can perform cross-validation. As previously discussed, time-series analysis strictly uses train data whose time range is earlier than that of test data. Below is an example where we use 5 years of train data to predict 1-year of test data. Each cut-off point is equally spaced with 1 year gap.

Time-series cross validation

Prophet also provides built-in model diagnostics tools to make it easy to perform this cross-validation. You just need to define three parameters: horizon, initial, and period. The latter two are optional.

horizon: test period of each fold
initial: minimum training period to start with
period: time gap between cut-off dates

Make sure to define these parameters in string and in this format: ‘X unit’. X is the number and unit is ‘days’ or ‘secs’, etc. that is compatible with pd.Timedelta. For example, 10 days.

You can also define parallel to make the cross validation faster.

from fbprophet.diagnostics import cross_validation 

# test period
horizon = '365 days'

# itraining period (optional. default is 3x of horizon)
initial = str(365 * 5) + ' days'  

# spacing between cutoff dates (optional. default is 0.5x of horizon)
period = '365 days' 

df_cv = cross_validation(m, initial=initial, period=period, horizon=horizon, parallel='processes')

    INFO:fbprophet:Making 43 forecasts with cutoffs between 1975-12-12 00:00:00 and 2017-12-01 00:00:00
    INFO:fbprophet:Applying in parallel with <concurrent.futures.process.ProcessPoolExecutor object at 0x12fb4d3c8>

This is the predicted output using cross-validation. There can be many predictions for the same timestamp if period is smaller than horizon.

# predicted output using cross validation
df_cv

	ds	yhat	yhat_lower	yhat_upper	y	cutoff
0	1976-01-01	102282.737592	100862.769604	103589.684840	102460.0	1975-12-12
1	1976-02-01	96811.141761	95360.095284	98247.364027	98528.0	1975-12-12
2	1976-03-01	112360.483572	110908.136982	113775.264669	114284.0	1975-12-12
3	1976-04-01	112029.016859	110622.916037	113458.999123	117014.0	1975-12-12
4	1976-05-01	119161.998160	117645.653475	120579.267732	123278.0	1975-12-12
...	...	...	...	...	...	...
511	2018-08-01	279835.003826	274439.830747	285259.974314	284989.0	2017-12-01
512	2018-09-01	261911.246557	256328.677902	267687.122886	267434.0	2017-12-01
513	2018-10-01	268979.448383	263001.411543	274742.978202	281382.0	2017-12-01
514	2018-11-01	255612.520483	249813.339845	261179.979649	260473.0	2017-12-01
515	2018-12-01	257049.510224	251164.508448	263062.671327	270370.0	2017-12-01

516 rows × 6 columns

Below are different performance metrics for different rolling windows. As we did not define any rolling window, Prophet went ahead and calculated many different combinations and stacked them up in rows (e.g. 53 days, …, 365 days). Each metric is first calculated within each rolling window and then averaged across many available windows.

from fbprophet.diagnostics import cross_validation, performance_metrics 

# performance metrics  
df_metrics = performance_metrics(df_cv)  # can define window size, e.g. rolling_window=365
df_metrics

	horizon	mse	rmse	mae	mape	mdape	coverage
0	53 days	3.886562e+07	6234.229883	5143.348348	0.030813	0.027799	0.352941
1	54 days	3.983610e+07	6311.584390	5172.484468	0.030702	0.027799	0.372549
2	55 days	4.272605e+07	6536.516453	5413.997433	0.031607	0.030305	0.352941
3	56 days	4.459609e+07	6678.030078	5662.344846	0.032630	0.031911	0.313725
4	57 days	4.341828e+07	6589.254589	5650.202377	0.032133	0.031481	0.313725
...	...	...	...	...	...	...	...
115	361 days	2.880647e+07	5367.165528	3960.025025	0.020118	0.015177	0.607843
116	362 days	3.158472e+07	5620.028791	4158.035261	0.020836	0.015177	0.588235
117	363 days	3.798731e+07	6163.384773	4603.360382	0.022653	0.017921	0.549020
118	364 days	4.615621e+07	6793.836092	4952.443173	0.023973	0.018660	0.529412
119	365 days	5.428934e+07	7368.129817	5262.131511	0.024816	0.018660	0.529412

120 rows × 7 columns

5. Trend Change Points

Another interesting functionality of Prophet is add_changepoints_to_plot. As we discussed in the earlier sections, there are a couple of points where the growth rate changes. Prophet can find those points automatically and plot them!

from fbprophet.plot import add_changepoints_to_plot

# plot change points
fig = m.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), m, forecast)

6. Seasonality Mode

The growth in trend can be additive (rate of change is linear) or multiplicative (rate changes over time). When you see the original data, the amplitude of seasonality changes - smaller in the early years and bigger in the later years. So, this would be a multiplicative growth case rather than an additive growth case. We can adjust the seasonality parameter so we can take into account this effect.

# additive mode
m = Prophet(seasonality_mode='additive')
# multiplicative mode
m = Prophet(seasonality_mode='multiplicative')

You can see that the blue lines (predictions) are more in line with the black dots (actuals) when in multiplicative seasonality mode.

7. Saving Model

We can also easily export and load the trained model as json.

import json
from fbprophet.serialize import model_to_json, model_from_json

# Save model
with open('serialized_model.json', 'w') as fout:
    json.dump(model_to_json(m), fout)

# Load model
with open('serialized_model.json', 'r') as fin:
    m = model_from_json(json.load(fin))  

8. References

Getting “Failed to build gem native extension” Error After Upgrading to Mac OS Big Sur

2020-12-14T05:00:00+00:00

Problem

After upgrading to Mac OS Big Sur, I saw this error I haven’t seen before:

➜  bundle exec jekyll serve
Could not find commonmarker-0.17.13 in any of the sources
Run `bundle install` to install missing gems.

So I ran bundle install but it didn’t seem to work either. An excerpt from the terminal output:

➜  bundle install
Gem::Ext::BuildError: ERROR: Failed to build gem native extension.

An error occurred while installing commonmarker (0.17.13), and Bundler cannot continue.
Make sure that `gem install commonmarker -v '0.17.13' --source 'https://rubygems.org/'` succeeds before bundling.

An error occurred while installing unf_ext (0.0.7.7), and Bundler cannot continue.
Make sure that `gem install unf_ext -v '0.0.7.7' --source 'https://rubygems.org/'` succeeds before bundling.

An error occurred while installing rdiscount (2.2.0.2), and Bundler cannot continue.
Make sure that `gem install rdiscount -v '2.2.0.2' --source 'https://rubygems.org/'` succeeds before bundling.

Solution

After some research and trial and error, I found this precious answer in Stack Overflow that it is due to the Ruby version that is not compatible with Big Sur and it should be at least 2.7. So I checked Ruby releases and decided to go with one of the most recent releases: 2.7.2.

Anyway, steps I think worked:

Check Ruby version

 ➜  ruby -v
 ruby 2.6.3p62 (2019-04-16 revision 67580) [universal.x86_64-darwin20]

Install Ruby Version Manager (rvm)

 ➜  curl -sSL https://raw.githubusercontent.com/rvm/rvm/master/binscripts/rvm-installer | bash -s stable

Install 2.7.2 version using rvm
```
 ➜  rvm install "ruby-2.7.2"
```

Check Ruby version again

 ➜  ruby -v
 ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-darwin20]

Bundle install
```
 ➜  bundle install
```
Run bundle
```
 ➜  bundle exec jekyll serve
```

And it worked!

Some other things I tried

Installing Ruby through Homebrew: didn’t solve the issue but I don’t know if this actually and eventually helped or not.
```
 ➜  brew install ruby
```
Installing the latest version of Ruby (ref) using rvm: didn’t update the version for some reason.
```
 ➜  rvm install ruby@latest
```

Finding The Best Feature Engineering Strategy Using sklearn GridSearchCV

2020-12-06T03:49:00+00:00

Photo by Marten Newhall on Unsplash

We previously reviewed a few missing data imputation strategies using sklearn in this post, but which one should we use? How do we know which one works best for our data? Should we manually write a script to fit a model for each strategy and track the model performance? We could, but it would be a headache to track many different models, especially if we use cross validation to get more reliable experiment results.

Fortunately, sklearn offers great tools to streamline and optimize the process, which are GridSearchCV and Pipeline! You might be already familiar with using GridSearchCV for finding optimal hyperparameters of a model, but you might not be familiar with using it for finding optimal feature engineering strategies.

In this post, I would like to walk you through how GridSearchCV and Pipeline can be used to find the best feature engineering strategies for the given data. We will focus on missing data imputation strategies here but it can be used for any other feature engineering steps or combinations.

Table of Conents

Prepare Data
Setup a Base Pipeline
Finding The Best Imputation Technique Using GridSearchCV
References

1. Prepare Data

First, import necessary libraries and prepare data. We will use the house price data from Kaggle in this post.

import pandas as pd 

# preparing data 
from sklearn.model_selection import train_test_split

# feature scaling, encoding
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# putting together in pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# model selection
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

# import house price data 
df = pd.read_csv('../data/house_price/train.csv', index_col='Id')

# find numerical columns vs. categorical columns, except for the target ('SalePrice')
num_cols = df.drop('SalePrice', axis=1).select_dtypes('number').columns
cat_cols = df.drop('SalePrice', axis=1).select_dtypes('object').columns

# define X and y for GridSearchCV
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']

# split train and test dataset 
X_train, X_test, y_train, y_test = train_test_split(df.drop('SalePrice', axis=1), 
                                                    df['SalePrice'], 
                                                    test_size=0.3, 
                                                    random_state=0)

2. Setup a Base Pipeline

2.1. Define Pipelines

The next step is defining a base Pipeline for our model as below.

Define two feature preprocessing pipelines; one for numerical variables (num_pipe) and the other for categorical variables (cat_pipe). num_pipe has SimpleImputer for missing data imputation and StandardScaler for scaling data. cat_pipe has SimpleImputer for missing data imputation and OneHotEncoder for encoding categorical data as numerical data.
Combine those two pipelines together using ColumnTransformer to apply them to a different set of columns.
Define the final pipeline called pipe by putting the preprocess pipeline together with an estimator, which is Lasso regression in this example.

For details of this pipeline, please check out the previous post Combining Feature Engineering and Model Fitting (Pipeline vs. ColumnTransformer).

# feature engineering pipeline for numerical variables 
num_pipe= Pipeline([('imputer', SimpleImputer(strategy='mean', add_indicator=False)),
                    ('scaler', StandardScaler())])

# feature engineering pipeline for categorical variables 
# Note: fill_value='Missing' is not used for strategy='most_frequent' but defined here for later use
cat_pipe = Pipeline([('imputer', SimpleImputer(strategy='most_frequent', fill_value='Missing')),
                     ('encoder', OneHotEncoder(handle_unknown='ignore'))])

# put numerical and categorical feature engineering pipelines together
preprocess = ColumnTransformer([("num_pipe", num_pipe, num_cols),
                                ("cat_pipe", cat_pipe, cat_cols)])


# put transformers and an estimator together
pipe = Pipeline([('preprocess', preprocess),
                 ('lasso', Lasso(max_iter=10000))]) 

2.2. Fit Pipeline

Okay, so let’s fit the model with our train data and test with the test data. Here, we get 0.63 for the score, which is $R^2$ of the prediction in this case (sklearn Lasso).

# fit model 
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

0.6308258188969262

We could also cross validate the model using cross_val_score. It splits the whole data into 5 sets and calculates the score 5 times by fitting and testing with different sets each time.

# cross validate
cross_val_score(pipe, X, y, cv=5)

array([0.85570392, 0.8228412 , 0.80381056, 0.88846653, 0.63236809])

2.3. Diffferent Parameters To Test

Let’s say we want to try different combinations of missing data imputation strategies for SimpleImputer, such as both 'median' and 'mean' for strategy and both True and False for add_indicator. To compare all of the cases, we need to test 4 different models with the following numerical variable imputation methods:

SimpleImputer(strategy='mean', add_indicator=False)
SimpleImputer(strategy='median', add_indicator=False)
SimpleImputer(strategy='mean', add_indicator=True)
SimpleImputer(strategy='median', add_indicator=True)

We could copy and paste the script we wrote above, replace the corresponding step, and compare the performance of each case. It would not be too bad for the 4 combinations. But what if we want to test more combinations such as strategy='constant' and strategy='most_frequent' for categorical variables? Now it becomes 8 combinations ($2 \times 2 \times 2 = 8$).

The more parameters we add, the more cases we have to test ad track (exponentially growing cases!). But don’t worry! We have GridSearchCV.

3. Finding The Best Imputation Technique Using GridSearchCV

3.1. What Is GridSearchCV?

GridSearchCV is a sklearn class that is used to find parameters with the best cross validation given the search space (parameter combinations). This can be used not only for hyperparameter tuning for estimators (e.g. alpha for Lasso), but also for parameters in any preprocessing step. We just need to define parameters that we want to optimize and pass them to GridSearchCV() as a dictionary.

The rule for defining the grid search parameter key-value pair is the following:

Key: a string that combines the name of the step with the name of the parameter with two understcores
Value: a list of parameter values to test

In short, it’s {'step_name__parameter_name': a list of values}. For example, if the step name is lasso and the parameter name is alpha, your grid search param becomes: {'lasso__alph': [1, 5, 10]}

3.2. Defining Nested Parameters

What about nested parameters that we have in our case? For example, our missing data imputation strategy for numerical variables is a few steps away from the final pipeline such as preprocess –> num_pipe –> imputer.

Even for those cases, we can simply expand the key by keeping combining them with two unerstcore:

{'preprocess__num_pipe__imputer__strategy': ['mean', 'median', 'most_frequent']}

3.3. Defining and Fitting GridSearchCV

With the basics of GridSearchCV, let’s define GridSearchCV and its parameters for our problem.

# define the GridSearchCV parameters 
param = dict(preprocess__num_pipe__imputer__strategy=['mean', 'median', 'most_frequent'],
             preprocess__num_pipe__imputer__add_indicator=[True, False],
             preprocess__cat_pipe__imputer__strategy=['most_frequent', 'constant']) 

# define GridSearchCV 
grid_search = GridSearchCV(pipe, param)

Now it’s time to find the best parameters by simply running fit!

# search the best parameters by fitting the GridSearchCV 
grid_search.fit(X, y)

3.4. Checking the results

To check the combinations of parameters we tested and their performances in each cross validation set in terms of score and time, we can use the attribute .cv_results.

# check out the results
grid_search.cv_results_

So, which model did the GridSearchCV find to be most effective and what’s its score? Let’s check out .bast_params_ and .best_score_ attributes for that.

# check out the best parameter combination found
grid_search.best_params_

{'preprocess__cat_pipe__imputer__strategy': 'constant',
 'preprocess__num_pipe__imputer__add_indicator': False,
 'preprocess__num_pipe__imputer__strategy': 'most_frequent'}

# score 
grid_search.best_score_

0.8058139542143075

Awesome! It seems like using a constant value for categorical variables and most_frequent values for numerical variables without missing indicator was found to be most effective in this case. Again, the best missing data imputation strategy depends on the data and the model. Try out with your data and see what works best for yours!

4. References

Combining Feature Engineering and Model Fitting (Pipeline vs. ColumnTransformer)

2020-11-28T04:00:00+00:00

Photo by Crystal Kwok on Unsplash

In the previous post, we learned about various missing data imputation strategies using scikit-learn. Before diving into finding the best imputation method for a given problem, I would like to first introduce two scikit-learn classes, Pipeline and ColumnTransformer.

Both Pipeline amd ColumnTransformer are used to combine different transformers (i.e. feature engineering steps such as SimpleImputer and OneHotEncoder) to transform data. However, there are two major differences between them:

1. Pipeline can be used for both/either of transformer and estimator (model) vs. ColumnTransformer is only for transformers
2. Pipeline is sequential vs. ColumnTransformer is parallel/independent

Don’t worry if this sounds too complicated! I will walk you through what I mean by the above statements with code examples. I had a lot of fun while digging into these two classes, so I hope you enjoy and find it useful at the end as well!

Table of Conents

Prepare Data
Put Transformers and an Estimator Together: Pipeline
Apply Transformers to Different Columns: ColumnTransformer
Separate Feature Engineering Pipelines for Numerical and Categorical Variables
Final Pipeline
Summary
References

0. Prepare Data

Let’s first prepare the house price data from Kaggle we will be using in this post. The data is preprocessed by replacing '?' with NaN. Do not forget to split the data into train and test sets before performing any feature engineering steps to avoid data leakage!

import pandas as pd 

# preparing data 
from sklearn.model_selection import train_test_split

# feature engineering: imputation, scaling, encoding
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# putting together in pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# model to use
from sklearn.linear_model import Lasso

# import house price data 
df = pd.read_csv('../data/house_price/train.csv', index_col='Id')

# numerical columns vs. categorical columns 
num_cols = df.drop('SalePrice', axis=1).select_dtypes('number').columns
cat_cols = df.drop('SalePrice', axis=1).select_dtypes('object').columns

# split train and test dataset 
X_train, X_test, y_train, y_test = train_test_split(df.drop('SalePrice', axis=1), 
                                                    df['SalePrice'], 
                                                    test_size=0.3, 
                                                    random_state=0)

# check the size of train and test data
X_train.shape, X_test.shape

((1022, 79), (438, 79))

X_train.head()

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	...	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
Id
65	60	RL	NaN	9375	Pave	NaN	Reg	Lvl	AllPub	Inside	...	0	0	NaN	GdPrv	NaN	0	2	2009	WD	Normal
683	120	RL	NaN	2887	Pave	NaN	Reg	HLS	AllPub	Inside	...	0	0	NaN	NaN	NaN	0	11	2008	WD	Normal
961	20	RL	50.0	7207	Pave	NaN	IR1	Lvl	AllPub	Inside	...	0	0	NaN	NaN	NaN	0	2	2010	WD	Normal
1385	50	RL	60.0	9060	Pave	NaN	Reg	Lvl	AllPub	Inside	...	0	0	NaN	MnPrv	NaN	0	10	2009	WD	Normal
1101	30	RL	60.0	8400	Pave	NaN	Reg	Bnk	AllPub	Inside	...	0	0	NaN	NaN	NaN	0	1	2009	WD	Normal

5 rows × 79 columns

1. Put Transformers and an Estimator Together: Pipeline

Let’s say we want to train a Lasso regression model that predicts SalePrice. Instead of using all of the 79 variables we have, let’s use only numerical variables this time.

I already know there is plenty of missing data in some columns (e.g. LotFrontage, MasVnrArea, and GarageYrBlt among numerical columns), so we want to perform missing data imputation before fitting a model. Also, let’s say we also want to scale the data using StandardScaler because the scale of variables is all different.

This is what we would do normally to fit a model:

# take only numerical data
X_temp = X_train[num_cols].copy()

# missing data imputation
imputer = SimpleImputer(strategy='mean')
X_impute = imputer.fit_transform(X_temp)  # np.ndarray
X_impute = pd.DataFrame(X_impute, columns=X_temp.columns)  # pd.DataFrame

# scale data 
scaler = StandardScaler()
X_scale = scaler.fit_transform(X_impute)  # np.ndarray
X_scale = pd.DataFrame(X_scale, columns=X_temp.columns)  # pd.DataFrame

# fit model
lasso = Lasso()
lasso.fit(X_scale, y_train)
lasso.score(X_scale, y_train)

0.8419801151434141

This is great but we have to manually move data from one step to another: we pass the output of the first step (SimpleImputer) to the second step (StandardScaler) as an input (X_impute). And then, the output of the second step (StandardScaler) is passed to the third step (Lasso) as an input (X_scale). If we have more feature engineering steps, it will be more complex to handle different inputs and outputs. So, here Pipeline comes to the rescue!

With Pipeline, you can combine transformers and an estimator (model) together. You can transform your data and then fit a model with the transformed data. You just need to pass a list of tuples defining the steps in order: (step_name, transformer or estimator object). Let’s rewrite the same logic using Pipeline.

# define feature engineering and model together
pipe = Pipeline([('imputer', SimpleImputer(strategy='mean')),
                 ('scaler', StandardScaler()),
                 ('lasso', Lasso())])

# fit model
pipe.fit(X_temp, y_train)
pipe.score(X_temp, y_train)

0.8419801151434141

Awesome! We saved a lot of lines and it looks much cleaner and more understandable! As you can see, Pipeline passes the first step’s output to the next step as its input, meaning Pipeline is sequential.

2. Apply Transformers to Different Columns: ColumnTransformer

Let’s go back to our original dataset where we had both numerical and categorical variables. Because we cannot apply mean imputation to categorical variables (there is no ‘mean’ in categories!), we would want to use something different. One of the commonly used techniques is mode imputation (filling with the most frequent category), so let’s use that.

Mean imputation for numerical variables and mode imputation for categorical variables - can we do this in Pipeline as below?

# Can we do this? 
pipe = Pipeline([('num_imputer', SimpleImputer(strategy='mean')),
                 ('cat_imputer', SimpleImputer(strategy='most_frequent')),
                 ('lasso', Lasso())])

pipe.fit(X_train, y_train)

Unfortunately, no! If you run the above code, it will throw an error like this:

ValueError: Cannot use mean strategy with non-numeric data:
could not convert string to float: 'RL'

The error happens when Pipeline attempts to apply mean imputation to all of the columns including a categorical variable that contains a string category called 'RL'. Remember mean imputation can only be applied to numerical variables so our SimpleImputer(strategy='mean') freaked out!

We need to let our Pipeline know which columns to apply which transformer. How do we do that? We do it with ColumnTransformer!

ColumnTransformer is similar to Pipeline in the sense that you put transformers together as a list of tuples, but in this time, you pass one more argument: a list of the column names you want to apply a transformer.

# applying different transformers to different columns 
transformer = ColumnTransformer(
    [('numerical', SimpleImputer(strategy='mean'), num_cols), 
     ('categorical', SimpleImputer(strategy='most_frequent'), cat_cols)])

# fit transformer with out train data
transformer.fit(X_train)

# transform the train data and create a DataFrame with the transformed data
X_train_transformed = transformer.transform(X_train)
X_train_transformed = pd.DataFrame(X_train_transformed, 
                                   columns=list(num_cols) + list(cat_cols))

You may have noticed we defined the output columns to be list(num_cols) + list(cat_cols), not X_train.columns. This is because ColumnTransformer fits each transformer independently in parallel and concatenates all of the outputs at the end.

That is, ColumnTransformer takes only numerical columns (num_cols), fits and transforms them using SimpleImputer(strategy='mean'), sets the output aside. At the same time, it does the same thing for categorical columns (cat_cols) with SimpleImputer(strategy='most_frequent'). When it is done with each and every step, it combines all of the two outputs in the order that the transformers are performed. Therefore, be aware of the column orders because the final output may be different from your original DataFrame!

Note that ColumnTransformer can only be used for transformers, not estimators. We cannot include Lasso() and fit the model as we did with Pipeline. ColumnTransformer is only used for data pre-processing, so there is no predict or score as in Pipeline. To train a model and calculate a performance score, we will need Pipeline again.

3. Separate Feature Engineering Pipelines for Numerical and Categorical Variables

Let’s go one step further and include more feature engineering steps. In addition to the missing data imputation, we also want to scale our numerical variables using StandardScaler and encode the categorical variables using OneHotEncoder. Can we do something like this then?

# Can we do this? 
transformer = ColumnTransformer(
    [('numerical_imputer', SimpleImputer(strategy='mean'), num_cols), 
     ('numerical_scaler', StandardScaler(), num_cols), 
     ('categorical_imputer', SimpleImputer(strategy='most_frequent'), cat_cols),
     ('categorical_encoder', OneHotEncoder(handle_unknown='ignore'), cat_cols)])

transformer.fit(X_train)

No!

As we saw in the previous section, each step in ColumnTransformer is independent. Therefore, the input for the OneHotEncoder() is not the output of the SimpleImputer(strategy='most_frequent') but just a subset of the original DataFrame (cat_cols) which is not imputed. You cannot one-hot-encode a categorical variable that has missing data.

We need something that can sequentially pass data throughout multiple feature engineering steps. Sequentially moving data… sounds familiar, right? Yes, you can do this with Pipeline!

However, we need to create a feature engineering pipeline for numerical variables and categorical variables separately. So, we can come up with something like this:

# feature engineering pipeline for numerical variables 
num_pipeline= Pipeline([('imputer', SimpleImputer(strategy='mean')),
                        ('scaler', StandardScaler())])

# feature engineering pipeline for categorical variables 
cat_pipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                        ('encoder', OneHotEncoder(handle_unknown='ignore'))])

You can think it as creating a ‘new transformer’ that combines multiple transformers for each type of variable. Doesn’t it sounds cool?

4. Final Pipeline

Okay. Now that we have feature engineering pipelines defined for both numerical variables and categorical variables, we can put things together to train a Lasso model using ColumnTransformer and Pipeline.

# put numerical and categorical feature engineering pipelines together
preprocessor = ColumnTransformer([("num_pipeline", num_pipeline, num_cols),
                                  ("cat_pipeline", cat_pipeline, cat_cols)])

# put transformers and an estimator together
pipe = Pipeline([('preprocessing', preprocessor),
                 ('lasso', Lasso(max_iter=10000))])  # increased max_iter to converge

# fit model 
pipe.fit(X_train, y_train)
pipe.score(X_train, y_train)

0.9483539967729575

This is very neat! We applied different sets of feature engineering steps to numercial and categorical variables and then trained a model in only a few lines of code.

Thinking of how long and complex the code would be without ColumnTransformer and Pipeline, aren’t you tempted to try this out right now?

Summary

In this post, we looked at how to combine feature engineering steps and a model fitting step together using Pipeline and ColumnTransformer. Especially we learned that we can use

Pipeline for combining transformers and an estimator
ColumnTransformer for applying different transformers to different columns
Pipeline for creating different feature engineering pipelines for numerical and categorical variables that sequentially apply a different set of transformers

Also, check out the table below to recap the differences between Pipeline vs. ColumnTransformer:

	Pipeline	ColumnTransformer
Used for	Both/either of transformers and estimator	Transformers only
Main methods	fit, transform, predict, and score	fit, and transform (no predict or score)
Can pick columns to apply	No	Yes
Each step is performed	Sequentially	Independently
Transformed output columns	Same as input	May differ depending on the defined steps

References

Missing Data Imputation Using sklearn

2020-11-21T04:00:00+00:00

Why does missing data matter?
What are the options for missing data imputation?
Missing data imputation using scikit-learn
What to use?
References

Why does missing data matter?

If you ever worked on raw data collected from a survey or a sensor that is not cleaned yet, you might have also faced missing data. Let’s think about a dataset of age, gender, and height as below. You want to use both age and gender to predict height, but you have some data points that either have only age or only gender. What would you do in this case?

	age	gender	height
0	20	F	5’4”
1	31	M	6’1”
2	40		5’0”
3		M	5’6”

When certain fields are missing in observation, you either 1) remove the entire observation or 2) keep the observation and replace the missing values with some estimation. Analyzing with complete data after removing any missing data is called Complete Case Analysis (CCA) and replacing missing values with estimation is called missing data imputation.

Normally, you don’t want to remove the entire observation because the rest of the fields can still be informative. Also, when you have lots of variables that are missing in different observations, the chances are you will have to remove the majority of data points and end up being left with limited data to train a model. Even if you manage to build a model, the model will have to know how to handle missing data in production, otherwise, it will freak out and refuse to make any prediction for new data with a missing field!

Therefore, we would want to perform missing data imputation and this post is about how we can do that in Python.

What are the options for missing data imputation?

There are many imputation methods available and each has pros and cons

Univariate methods (use values in one variable)
- Numerical
  - mean, median, mode (most frequent value), arbitrary value (out of distribution)
  - For time series: linear interpolation, last observation carried forward, next observation carried backward
- Categorical
  - mode (most frequent category), arbitrary value (e.g. “missing” category)
- Both
  - random value selected from train data separately for each missing data
Multi-variable methods (use values in other variables as well)
- KNN
- Regression
- Chained equation
Adding missing indicator
- Adding boolean value to indicate the observation has missing data or not. It is used with one of the above methods.

Although they are all useful in one way or another, in this post, we will focus on 6 major imputation techniques available in sklearn: mean, median, mode, arbitrary, KNN, adding a missing indicator. I will cover why we choose sklearn for our missing imputation in the next post.

Missing data imputation using scikit-learn

(0) Prepare data

In this post, we will use the trainset from the house price data from Kaggle. The data is preprocessed so that string value ? is transformed into NaN.

import pandas as pd
import numpy as np

# prep dataset 
from sklearn.model_selection import train_test_split

# imputer
from sklearn.impute import SimpleImputer, KNNImputer

# plot for comparison 
import matplotlib.pyplot as plt 

df = pd.read_csv('../data/house_price/train.csv', index_col='Id')

df.shape

(1460, 80)

There are 80 columns, where 79 are features and 1 is our target variable SalePrice. Let’s check how many are numerical and categorical as we will apply different impuation strategies to different data types. .select_dtypes() in pandas is a handy way to filter data types.

# numerical columns vs. categorical columns 
num_cols = df.drop('SalePrice', axis=1).select_dtypes('number').columns
cat_cols = df.drop('SalePrice', axis=1).select_dtypes('object').columns

print(f"Number of numerical columns: {len(num_cols)}")
print(f"Number of categorical columns: {len(cat_cols)}")

Number of numerical columns: 36
Number of categorical columns: 43

Next is slitting data. It is important to split the data into train and test set BEFORE, not after, applying any feature engineering or feature selection steps in order to avoid data leakage. Data leakage means using information that is not available in production during training which leads to model performance inflation. As we want our model performance score to be as close to the real performance in production as possible, we want to split the data as early as possible even before feature engineering steps.

X_train, X_test, y_train, y_test = train_test_split(df.drop('SalePrice', axis=1), 
                                                    df['SalePrice'], 
                                                    test_size=0.3, 
                                                    random_state=0)

X_train.shape, X_test.shape

((1022, 79), (438, 79))

Now let’s check which columns have missing data, NaN. .isna() will give you True/False indicator of if element is NaN and .mean() will calculate what perforcentage of True there are in each column. We will filter columns with mean greater than 0, which means there is at least one missing data.

# number of numerical columns and categorical columns that contain missing data
num_cols_with_na = num_cols[X_train[num_cols].isna().mean() > 0]
cat_cols_with_na = cat_cols[X_train[cat_cols].isna().mean() > 0]

print(f"*** numerical columns that have NaN's ({len(num_cols_with_na)}): \n{num_cols_with_na}\n\n")
print(f"*** categorical columns that have NaN's ({len(cat_cols_with_na)}): \n{cat_cols_with_na}")

*** numerical columns that have NaN's (3): 
Index(['LotFrontage', 'MasVnrArea', 'GarageYrBlt'], dtype='object')


*** categorical columns that have NaN's (16): 
Index(['Alley', 'MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
       'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'FireplaceQu',
       'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC',
       'Fence', 'MiscFeature'],
      dtype='object')

Let’s check which feature has how much missing data. It seems that there are three features (PoolQC, MiscFeature, Alley) that have more than 90% of data missing. In such cases, it might be better to remove the entire feature because they do not provide much information when predicting house price. We could perform feature selection to see if it is worth including them or not. However, it is not the scope of this post, so we will include all of them for now.

# percentage of missing data in numerical features
X_train[num_cols_with_na].isna().mean().sort_values(ascending=False)

LotFrontage    0.184932
GarageYrBlt    0.052838
MasVnrArea     0.004892
dtype: float64

# percentage of missing data in categorical features
X_train[cat_cols_with_na].isna().mean().sort_values(ascending=False)

PoolQC          0.997065
MiscFeature     0.956947
Alley           0.939335
Fence           0.813112
FireplaceQu     0.467710
GarageCond      0.052838
GarageQual      0.052838
GarageFinish    0.052838
GarageType      0.052838
BsmtFinType2    0.024462
BsmtFinType1    0.023483
BsmtExposure    0.023483
BsmtCond        0.023483
BsmtQual        0.023483
MasVnrType      0.004892
Electrical      0.000978
dtype: float64

(1) Mean/median

First missing data imputation method we will look at is mean/median imputation. As the name implies, it fills missing data with the mean or the median of each variable.

When should we mean vs median? If the variable is normally distributed, the mean and the median do not differ a lot. However, if the distribution is skewed, the mean is affected by outliers and can deviate a lot from the mean, so the median is a better representationo for skewed data. Therefore, use the mean for normal distribution and the median for skewed distribution.

Fig 1. Skewness (Skewness)

Assumptions

Missing data most likely look like the majority of the data
Data is missing at random

Pros

Easy and fast
Can easily be integrated into production

Cons

It distorts the original variable distribution (more values around the mean will create more outliers)
It ignores and distorts the correlation with other variables

A common practice is to use mean/median imputation with combination of ‘missing indicator’ that we will learn in a later section. This is the top choice in data science competitions.

Below is how we use the mean/median imputation. It only works for numerical data. To make it simple, we used columns with NA’s here (X_train[num_cols_with_na]).

# initialize imputer. use strategy='median' for median imputation
imputer = SimpleImputer(strategy='mean')

# fit the imputer on X_train. we pass only numeric columns with NA's here.
imputer.fit(X_train[num_cols_with_na])

# transform the data using the fitted imputer
X_train_mean_impute = imputer.transform(X_train[num_cols_with_na])
X_test_mean_impute = imputer.transform(X_test[num_cols_with_na])

# put the output into DataFrame. remember to pass columns used in fit/transform
X_train_mean_impute = pd.DataFrame(X_train_mean_impute, columns=num_cols_with_na)
X_test_mean_impute = pd.DataFrame(X_test_mean_impute, columns=num_cols_with_na)

# check statistics
print("Imputer statistics (mean values):")
print(imputer.statistics_)

Imputer statistics (mean values):
[  69.66866747  103.55358899 1978.01239669]

Previously, we mentioned the mean imputation can distort the original distribution. Let’s see how much it changed the data distribution by checking the density plot.

As you can see in LotFrontage, you have a lot more values around the mean after the imputation. The more missing data a variable has, the bigger the distortion is (LotFrontage has 18%, GarageYrBlt has 5%, and MasVnrArea has 0.5% of missing data).

One way to avoid this side effect is to use random data imputation. However, I excluded it from this post as it is not available in sklearn and it is not very production-friendly. It requires the whole populatlion of train data to be available to impute each missing data point.

# compare the distribution before/after mean imputation 

fig, axes = plt.subplots(figsize=(12, 4))
for i in range(len(num_cols_with_na)):
    ax = plt.subplot(1, 3, i+1)
    col = num_cols_with_na[i]
    
    X_train[col].plot.kde()
    X_train_mean_impute[col].plot.kde()
    
    ax.set_title(col)
    ax.legend(['original', 'mean imputation'])
plt.tight_layout()

(2) Mode (most frequent category)

The second method is mode imputation. It is replacing missing values with the most frequent value in a variable. It can be used for both numerical and categorical.

Assumptions

Missing data most likely look like the majority of the data
Data is missing at random

Pros

Easy and fast
Can easily be integrated into production

Cons

It distorts the original variable distribution (more values around the mean will create more outliers)
It ignores and distorts the correlation with other variables
The most frequent label might be over-represented while it is not the most representative value of a variable

This time, let’s try it to our categorical variables.

# initialize imputer 
imputer = SimpleImputer(strategy='most_frequent')

# fit the imputer on X_train. pass only numeric columns.
imputer.fit(X_train[cat_cols_with_na])

# transform the data using the fitted imputer
X_train_mode_impute = imputer.transform(X_train[cat_cols_with_na])
X_test_mode_impute = imputer.transform(X_test[cat_cols_with_na])

# put the output into DataFrame. remember to pass columns used in fit/transform
X_train_mode_impute = pd.DataFrame(X_train_mode_impute, columns=cat_cols_with_na)
X_test_mode_impute = pd.DataFrame(X_test_mode_impute, columns=cat_cols_with_na)

# check statistics
print("Imputer statistics (the most frequent values in each variable):")
print(imputer.statistics_)

Imputer statistics (the most frequent values in each variable):
['Pave' 'None' 'TA' 'TA' 'No' 'Unf' 'Unf' 'SBrkr' 'Gd' 'Attchd' 'Unf' 'TA'
 'TA' 'Gd' 'MnPrv' 'Shed']

Like the mean/median imputation, mode imputation can also distort the original distribution of a variable. In order to check the difference between before/after the mode imputation, we used bar plot this time as it is for categorical variables.

Let’s take a look at the first variable in the graph, Alley. As you can see, the distribution of the original data and that of the imputated data are very different and the Pave category is over-represented in the imputed data. Ideally the shape of the distribution should be preserved after imputation just like MasVnrType. However, if the majority of the observations is missing, the distribution of a variable can change significantly as does the correlation with other variables.

# compare the distribution before/after mode imputation 

fig, axes = plt.subplots(4,4, figsize=(15, 15))

for i in range(len(cat_cols_with_na)):
    col_name = cat_cols_with_na[i]
    original = X_train[col_name].value_counts()
    imputed = X_train_mode_impute[col_name].value_counts()
    combined = pd.concat([original, imputed], keys=['original', 'mode imputation'], axis=1)

    ax = axes[i//4, i%4]
    combined.plot.bar(ax=ax)
    ax.set_title(col_name)
    
plt.tight_layout()

(3) Arbitrary value

The third method is filling missing values with an arbitrary value outside of the training dataset. For example, if the values in the ‘age’ variable range from 0 to 80 in the training set, fill missing data with 100 (or using a value at the ‘end of distribution’ using mean +- 3*std). If categorical data, use ‘Missing’ as a new category for missing data. It can be counter-intuitive to fill data with a value outside of the original distribution as it will create outliers or unseen data. It indeed is not meant to be used for models that require certain assumptions about data distribution, such as linear regression. It is good for three-based models which will separate missing data in an earlier/upper node and take the missingness into account when building a model.

Assumptions

If data is not missing at random, they would want to flag them with a very different value than other observations and have them treated differently by a model

Pros

Easy and fast
It captures the importance of ‘missingness’

Cons

It distorts the original data distribution
It distorts the correlation between variables
It may mask or create outliers if numerical variables
It is not for linear models. Only use it for tree-based models.

It can be used for both numerical and categorical and numerical variable is more involved if we need to determine the fill value automatically.

Let’s see how we do for categorical variables first.

Categorical

Filling missing values with a new category called ‘missing’ or ‘Missing’ is a very common strategy for imputing missing data in categorical variable.

# initialize imputer 
imputer = SimpleImputer(strategy='constant', fill_value='Missing')

# fit the imputer on X_train. pass only numeric columns.
imputer.fit(X_train[cat_cols_with_na])

# transform the data using the fitted imputer
X_train_arb_impute = imputer.transform(X_train[cat_cols_with_na])
X_test_arb_impute = imputer.transform(X_test[cat_cols_with_na])

# put the output into DataFrame. remember to pass columns used in fit/transform
X_train_arb_impute = pd.DataFrame(X_train_arb_impute, columns=cat_cols_with_na)
X_test_arb_impute = pd.DataFrame(X_test_arb_impute, columns=cat_cols_with_na)

You can see there is now a new category ‘Missing’ in the imputed dataset.

# compare the distribution before/after mode imputation 

fig, axes = plt.subplots(4,4, figsize=(15, 15))

for i in range(len(cat_cols_with_na)):
    col_name = cat_cols_with_na[i]
    original = X_train[col_name].value_counts()
    imputed = X_train_arb_impute[col_name].value_counts()
    combined = pd.concat([original, imputed], 
                         keys=['original', 'Arbitrary value imputation'], 
                         axis=1)

    ax = axes[i//4, i%4]
    combined.plot.bar(ax=ax)
    ax.set_title(col_name)
    
plt.tight_layout()

Numerical

When determinining what value to use for numerical variables, one way to do is ‘end of distribution’ method.

For normal distribution: mean $\pm 3\times$ std
For skewed distribution, use either upper limit or lower limit:

Upper limit = 75th quantile + $1.5\times$ IQR
Lower limit = 25th quantile + $1.5\times$ IQR

where IQR = 75th qualtile - 25th quantile.

Let’s check the

# first find the value to use
def get_end_of_dist(X_train, col):
    q1 = X_train[col].quantile(0.25)
    q3 = X_train[col].quantile(0.75)
    iqr = q3-q1 

    new_val = q3 + iqr * 3
    return new_val

# determine which column to impute 
col = 'LotFrontage'

# initialize imputer 
imputer = SimpleImputer(strategy='constant', fill_value=get_end_of_dist(X_train, col))

# fit the imputer on X_train. pass only numeric columns.
imputer.fit(X_train[[col]])

# transform the data using the fitted imputer
X_train_arb_impute = imputer.transform(X_train[[col]])
X_test_arb_impute = imputer.transform(X_test[[col]])

# put the output into DataFrame. remember to pass columns used in fit/transform
X_train_arb_impute = pd.DataFrame(X_train_arb_impute, columns=[col])
X_test_arb_impute = pd.DataFrame(X_test_arb_impute, columns=[col])

When we check the plot below, we now have a small peak at around 150, which is the value that is determined by our get_end_of_dist function. This method definitely distorts the original data distribution and should be used carefully for only appropriate models.

# compare the distribution before/after mean imputation 

fig, ax = plt.subplots()
X_train[col].plot.kde()
X_test_arb_impute[col].plot.kde()

ax.set_title(col)
ax.legend(['original', 'End of tail imputation'])
plt.tight_layout()

(4) KNN imputer

KNN imputer is much more sophisticated and nuanced than the imputation methods described so far because it uses other data points and variables, not just the variable the missing data is coming from. KNN imputer calculates the distance between points (usually based on Eucledean distance) and finds the K closest (=similar) points. And then, estimates the missing value given what other points have for the variable. Note that this can only be used for numerical variables.

Pros

More accurate than univariate imputation
More likely to preserve the original distribution hence the covariance

Cons

Computationally expensive than univariate imputation
Can be sensitive to outliers (quality of other points)

# initialize imputer 
imputer = KNNImputer()

# fit the imputer on X_train. pass only numeric columns.
imputer.fit(X_train[num_cols_with_na])

# transform the data using the fitted imputer
X_train_knn_impute = imputer.transform(X_train[num_cols_with_na])
X_test_knn_impute = imputer.transform(X_test[num_cols_with_na])

# put the output into DataFrame. remember to pass columns used in fit/transform
X_train_knn_impute = pd.DataFrame(X_train_knn_impute, columns=num_cols_with_na)
X_test_knn_impute = imputer.transform(X_test[num_cols_with_na])

fig, axes = plt.subplots(figsize=(12, 4))
for i in range(len(num_cols_with_na)):
    ax = plt.subplot(1, 3, i+1)
    col = num_cols_with_na[i]
    
    X_train[col].plot.kde()
    X_train_knn_impute[col].plot.kde()
    
    ax.set_title(col)
    ax.legend(['original', 'mean imputation'])
plt.tight_layout()

(5) Adding Missing Indicator

Adding a binary missing indicator is another common practice when it comes to missing data imputation. This is used with other imputation techniques, such as mean, median, or mode imputation.

Assumptions

Data is not missing at random
Missingness provides information

Pros

Eeasy and fast
It captures the importance of missingness

Cons

It can expand the feature space pretty quickly if there are a lot of features

num_cols_with_na.append(num_cols_with_na + '_NA')

Index(['LotFrontage', 'MasVnrArea', 'GarageYrBlt', 'LotFrontage_NA',
       'MasVnrArea_NA', 'GarageYrBlt_NA'],
      dtype='object')

# initialize imputer 
imputer = KNNImputer(add_indicator=True)

# fit the imputer on X_train. pass only numeric columns.
imputer.fit(X_train[num_cols_with_na])

# transform the data using the fitted imputer
X_train_knn_impute = imputer.transform(X_train[num_cols_with_na])

# put the output into DataFrame. remember to pass columns used in fit/transform
X_train_knn_impute = pd.DataFrame(X_train_knn_impute, 
                                  columns=num_cols_with_na.append(num_cols_with_na + '_NA'))

X_train_knn_impute.head()

	LotFrontage	MasVnrArea	GarageYrBlt	LotFrontage_NA	GarageYrBlt_NA
0	71.6	573.0	1998.0	1.0	0.0
1	56.2	0.0	1996.0	1.0	0.0
2	50.0	0.0	1979.2	0.0	1.0
3	60.0	0.0	1939.0	0.0	0.0
4	60.0	0.0	1930.0	0.0	0.0

What to use?

The obvious question to follow next is then, “What method should we use?” The answer is tricky as there is no hard answer to what the best method is for every case.

The most commonly used technique, as described above, is using the mean/median imputation with combination of missing data indicator for numerical variables and filling missing data with the new ‘Missing’ category for categorical variables.

However, it is wise to still investigate different methods by cross validating different combinations of methods and see which method is most effective to your problem. In the next post, we will learn how to do it with sklearn and why it is useful to use sklearn for imputation rather than normal Pandas functions.

See you in the next post!

References

Adding a Horizontal Scroll to Overflowing Markdown Table in HTML

2020-11-15T23:58:00+00:00

Overflowing table

If you have a wide table, it might overflow your normal post width and look really ugly. It is the case for a lot of data science projects as there are many features in columns to analyze! For example, a table like this:

	col_name1	col_name2	col_name3	col_name4	col_name5	col_name6	col_name7	col_name8	col_name9	col_name10
row1
row2

Fortunately, I found a way to make such table fit nicely to my Jekyll page layout like this. (Is there a way to overflow a markdown table using HTML?)

	col_name1	col_name2	col_name3	col_name4	col_name5	col_name6	col_name7	col_name8	col_name9	col_name10
row1
row2

How to add a horizontal scroll

First, add the following wrapper rule to the css file.

.table-wrapper {
  overflow-x: scroll;
}

And then add this to before and after your table. Make sure you have a blank line between your table and the end </div> to see it in effect.

<div class="table-wrapper" markdown="block">

</div>

Applying this to the above example:

<div class="table-wrapper" markdown="block">

|      | col_name1 | col_name2 | col_name3 | col_name4 | col_name5 | col_name6 | col_name7 | col_name8 | col_name9 | col_name0 |
|------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| row1 |           |           |           |           |           |           |           |           |           |           |
| row2 |           |           |           |           |           |           |           |           |           |           |

</div>

What Should I Use for Dot Product and Matrix Multiplication?: NumPy multiply vs. dot vs. matmul vs. @

2020-08-30T04:00:00+00:00

When I first implemented gradient descent from scratch a few years ago, I was very confused which method to use for dot product and matrix multiplications - np.multiply or np.dot or np.matmul? And after a few years, it turns out that… I am still confused! So, I decided to investigate all the options in Python and NumPy (*, np.multiply, np.dot, np.matmul, and @), come up with the best approach to take, and document the findings here.

TLDL; Use np.dot for dot product. For matrix multiplication, use @ for Python 3.5 or above, and np.matmul for earlier versions.

What are dot product and matrix multiplications?
What is available for NumPy arrays?
(1) element-wise multiplication: * and sum
(2) element-wise multiplication: np.multiply and sum
(3) dot product: np.dot
(4) matrix multiplication: np.matmul
(5) matrix multiplication: @
So.. what’s with np.not vs. np.matmul (@)?
Summary
Reference

1. What are dot product and matrix multiplication?

If you are not familiar with dot product or matrix multiplication yet or if you need a quick recap, check out the previous blog post: What are dot product and matrix multiplication?

In short, the dot product is the sum of products of values in two same-sized vectors and the matrix multiplication is a matrix version of the dot product with two matrices. The output of the dot product is a scalar whereas that of the matrix multiplication is a matrix whose elements are the dot products of pairs of vectors in each matrix.

Dot product:

\[[a_1 \ a_2] \begin{bmatrix} b_1 \\ b_2 \end{bmatrix} =a_1b_1 + a_2b_2\]

Matrix multiplication:

\[\begin{bmatrix} a_{11} \ \ a_{12} \\ a_{21} \ \ a_{22} \\ \end{bmatrix} \begin{bmatrix} b_{11} \ \ b_{12} \\ b_{21} \ \ b_{22} \\ \end{bmatrix} = \begin{bmatrix} a_{11}b_{11} + a_{12}b_{21} \ \ \ a_{11}b_{12} + a_{12}b_{22}\\ a_{21}b_{11} + a_{22}b_{21} \ \ \ a_{21}b_{12} + a_{22}b_{22}\\ \end{bmatrix}\]

2. What’s available for NumPy arrays?

So, there are multiple options you can use to perform dot product or matrix multiplication:

basic element-wise multiplication: * or np.multiply along with np.sum
dot product: np.dot
matrix multiplication: np.matmul, @

We will go through different scenarios depending on the dimensions of vectors/matrices and understand the pros and cons of each method. To run the code in the following sections, We first need to import numpy.

import numpy as np

(1) element-wise multiplication: * and sum

First, we can try the fundamental approach using element-wise multiplication based on the definition of dot product: multiply corresponding elements in two vectors and then sum all the output values. The downside of this approach is that you need separate operations for product and sum and it is slower than other methods we will discuss later.

Here is an example of dot product with two 1D arrays.

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

>>> a*b
array([ 4, 10, 18])

>>> sum(a*b)
32

Can we use the same * and sum operation for matrix multiplication? Let’s check out.

c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([1, 1, 1])

>>> c*d
array([[1, 2, 3],
       [4, 5, 6]])
       
>>> sum(c*d)
array([5, 7, 9])

Wait, it looks different from what we would get from our own calculation below!

\[\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \\ 1 \\ \end{bmatrix} = \begin{bmatrix} 1 \times 1 + 2 \times 1 + 3 \times 1 \\ 4 \times 1 + 5 \times 1 + 6 \times 1 \end{bmatrix} =\begin{bmatrix} 6 \\ 15 \end{bmatrix}\]

So, it turns out that we need to be careful when we apply sum after * operation.

Let’s look at it step by step. Here is what happened at c*d. Each row of 2D array $c$ is considered as an element of the matrix and it is paired with the second array $d$ for element-wise multiplication.

\[\begin{bmatrix} [1 & 2 & 3] * [1 & 1 & 1] \\ [4 & 5 & 6] * [1 & 1 & 1] \end{bmatrix} = \begin{bmatrix} [1 \times 1 & 2 \times 1 & 3 \times 1] \\ [4 \times 1 & 5 \times 1 & 6 \times 1] \end{bmatrix} =\begin{bmatrix} 1 \ 2 \ 3 \\ 4 \ 5 \ 6 \end{bmatrix}\]

And then, when we apply sum, the Python’s default sum function takes all the element in a NumPy array at once, which became $1+2+ .. .+ 5+6 = 21$. But what we want is to sum only elements in each row. So we need to find an alternative to sum.

Here comes np .sum to rescue. When we pass the parameter axis=1, it sums elements across columns in the same row. The default is axis=0 which sums elements across rows within the same column, so we need to make sure we pass axis=1 parameter.

>>> np.sum(c*d, axis=1)
array([ 6, 15])

Yes! This is what we expected.

(2) element-wise multiplication: np.multiply and sum

Okay, then what about np.multiply? What does it do and is it different from *?

np.multiply is basically the same as *. It is a NumPy’s version of element-wise multiplication instead of Python’s native operator.

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

>>> c = np.multiply(a, b)
>>> c 
array([ 4, 10, 18])

>> np.sum(c, axis=1)
array([ 6, 15])

(3) dot product: np.dot

Is there any option that we can avoid the additional line of np.sum? Yes, np.dot in NumPy! You can use either np.dot(a, b) or a.dot(b) and it takes care of both element multiplication and sum. Simple and easy.

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

>>> np.dot(a, b)
32

Great! Dot product in just one line of code. If the dimension of the array is 2D or higher, make sure the number of columns of the first array matches up with the number of rows in the second array.

a = np.array([[1, 2, 3]])  # shape (1, 3)
b = np.array([[4, 5, 6]])  # shape (1, 3)

>>> np.dot(a, b)  
# ValueError: shapes (1,3) and (1,3) not aligned: 3 (dim 1) != 1 (dim 0)

To make the above example work, you need to transpose the second array so that the shapes are aligned: (1, 3) x (3, 1). Note that this will return (1, 1), which is a 2D array.

a = np.array([[1, 2, 3]])  # shape (1, 3)
b = np.array([[4, 5, 6]])  # shape (1, 3)

>>> np.dot(a, b.T)  
array([[32]])

As a side note, if you transpose the second array, you will get a (3 x 3) array, which is the outer product instead of inner product (dot product). So, be make sure you transpose the right one.

Now let’s try a 2D x 2D example as well with the following example. Will it work even if it’s called dot product?

\[\begin{bmatrix} 1, 2, 3 \\ 4, 5, 6 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \\ 1 \\ \end{bmatrix} = \begin{bmatrix} 1 \times 1 + 2 \times 1 + 3 \times 1 \\ 4 \times 1 + 5 \times 1 + 6 \times 1 \\ \end{bmatrix} = \begin{bmatrix} 6 \\ 15 \\ \end{bmatrix}\]

c = np.array([[1, 2, 3], [4, 5, 6]])  # shape (2, 3)
d = np.array([[1], [1], [1]])  # shape (3, 1)

>>> np.dot(c, d)
array([[ 6],
       [15]])

It works! Even if it is called dot, which indicates that the inputs are 1D vectors and the output is a scalar by its definition, it works for 2D or higher dimensional matrices as if it was a matrix multiplication.

So, should we use np.dot for both dot product and matrix multiplication?

Technically yes but it is not recommended to use np.dot for matrix multiplication because the name dot product has a specific meaning and it can be confusing to readers, especially mathematicians! (Reference) Also, it is not recommended for high dimensional matrices (3D or above) because np.dot behaves different from normal matrix multiplication. We will discuss this in the later of this post.

So, np.dot works for both dot product and matrix multiplication but is recommended for dot product only.

(4) matrix multiplication: np.matmul

The next option is np.matmul. It is designed for matrix multiplication and even the name comes from it (MATrix MULtiplication). Although the name says matrix multiplication, it also works in 1D array and can do dot product just like np.dot.

# 1D array
a = np.array([1, 2, 3])  # shape (1, 3)
b = np.array([4, 5, 6])  # shape (1, 3)

>>> np.matmul(a, b)
32

# 2D array with values in 1 axis
a = np.array([[1, 2, 3]])  # shape (1, 3)
b = np.array([[4, 5, 6]])  # shape (1, 3)

>>> np.dot(a, b.T) 
array([[32]])

# two 2D arrays
c = np.array([[1, 2, 3], [4, 5, 6]])  # shape (2, 3)
d = np.array([[1], [1], [1]])  # shape (3, 1)

>>> np.dot(c, d)
array([[ 6],
       [15]])

Nice! So, this means both np.dot and np.matmul work perfectly for dot product and matrix multiplication. However, as we said before, it is recommended to use np.dot for dot product and np.matmul for 2D or higher matrix multiplication.

(5) matrix multiplication: @

Here comes our last but not least option, @! @, pronounced as [at], is a new Python operator that was introduced since Python 3.5, whose name comes from mATrices. It is basically the same as np.matmul and designed to perform matrix multiplication. But why do we need a new infix if we already have np.matmul that works perfectly fine?

The major motivation for adding a new operator to stdlib was that the matrix multiplication is a so common operator that it deserves its own infix. For example, the operator // is much more uncommon than matrix multiplication but still has its own infix. To learn more about the background of this addition, check out this PEP 465.

# 1D array
a = np.array([1, 2, 3])  # shape (1, 3)
b = np.array([4, 5, 6])  # shape (1, 3)

>>> a @ b  
32

# 2D array with values in 1 axis
a = np.array([[1, 2, 3]])  # shape (1, 3)
b = np.array([[4, 5, 6]])  # shape (1, 3)

>>> a @ b.T
array([[32]])

# 2D arrays
c = np.array([[1, 2, 3], [4, 5, 6]])  # shape: (2, 3)
d = np.array([[1], [1], [1]])  # shape: (3, 1)

>>> c @ d
array([[ 6],
       [15]])

So, @ works exactly same as np.matmul. But which one should you use between np.matmul and @ then? Although it is your preference, @ looks cleaner than np.matmul in code. Let us see a case where have three matrices $x, y, z$ to perform a matrix multiplication.

# `np.matmul` version
np.matmul(np.matmul(x, y), z)

# `@` version
x @ y @ z

As you can see, @ is much cleaner and more readable. However, as it is available only Python 3.5+, you have to use np .matmul if you use an earlier Python version.

3. So.. what’s with np.dot vs. np.matmul (@)?

In the above section, I mentioned that np.dot is not recommended for high dimensional arrays. What do I mean by that?

There was an interesting question in stackoverflow about different behaviors between np.dot and @. Let’s looks at this.

# define input arrays
a = np.random.rand(3,2,2)  # 2 rows, 2 columns, in 3 layers 
b = np.random.rand(3,2,2)  # 2 rows, 2 columns, in 3 layers 

# perform matrix multiplication
c = np.dot(a, b)
d = a @ b  # Python 3.5+

>>> c.shape  # np.dot
(3, 2, 3, 2)

>>> d.shape  # @
(3, 2, 2)

With the same inputs, we have completely different outputs - 4D array for np.dot and 3D array for @. What happened? This is because of the way np.dot and @ are designed. Based on the their definition:

=======================
For matmul:

If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.

For np.dot:

For 2-D arrays it is equivalent to matrix multiplication, and for 1-D arrays to inner product of vectors (without complex conjugation). For N dimensions it is a sum product over the last axis of a and the second-to-last of b

If a is an N-D array and b is an M-D array (where M>=2), it is a sum product over the last axis of a and the second-to-last axis of b:

$ dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])$

=======================

Long story short, in the normal matrix multiplication situation where we want to treat each stack of matrices in the last two indexes, we should use matmul.

4. Summary

* == np.multiply != np.dot != np.matmul == @
* and np.multiply need np.sum to perform dot product. Not recommended for dot product or matrix multiplication.
np.dot works for dot product and matrix multiplication. However, recommended to avoid using it for matrix multiplication due to the name.
np.matmul and @ are the same thing, designed to perform matrix multiplication. @ is added to Python 3.5+ to give matrix multiplication its own infix.
np.dot and np.matmul generally behave similarly other than 2 exceptions: 1) matmul doesn’t allow multiplication by scalar, 2) the calculation is done differently for N>2 dimesion. Check the documentation which one you intend to use.

One line summary:

For dot product, use np.dot. For matrix multiplication, use @ for Python 3.5 or above, and np.matmul for earlier Python versions.

5. Reference

What Are Dot Product and Matrix Multiplication?

2020-08-23T19:56:00+00:00

1. What is dot prodcut?

The dot product is an algebraic operation that takes two same-sized vectors and returns a single number.

Algebraic definition
For two sequences of numbers, the dot product is the sum of the products of corresponding components of them. Think of two sequences $a$ and $b$ as below.

\[a = \begin{bmatrix} a_1 & a_2 & ... & a_n \end{bmatrix} \\ b = \begin{bmatrix} b_1 & b_2 & ... & b_n \end{bmatrix} \\\]

Then, the dot product of a and b becomes

\[a \cdot b = \sum_{i=1}^{n} a_i b_i\]

If $a$ and $b$ are row matrices, the dot product can be written as a matrix product. $a \cdot b = ab^\intercal$

For example, if $a = [a_1 \ a_2 \ a_3]$ and $b = [b_1 \ b_2 \ b_3]$, it becomes

\[[a_1 \ a_2 \ a_3] \begin{bmatrix} b_1 \\ b_2 \\ b_3 \end{bmatrix} =a_1b_1 + a_2b_2 + a_3b_3\]

Geometric definition
Geometrically, the dot product is the product of the Euclidean magnitudes of two vectors and the cosine of the angle between two.

\[a \cdot b = \vert a \vert \vert b \vert \rm cos \theta\]

Note that it is based on how much of one vector is in the direction of the other (projection). For example, in the below figure, the component of $A$ that is in the $B$ direction is $\vert A \vert \rm cos \theta$. Here, the magnitude of $A$ can be calculated by $\vert A \vert = \sqrt{x^2 + y^2}$ if $A = (x, y)$ and the initial point is the origin.

Fig 1. Projection of A onto B (Wikipedia)

Also note that if the two vectors are in the same direction, $\rm cos \theta = \rm cos 0^{\circ} = 1$ so it simply becomes the product of the magnitude of the two vectors, $a \cdot b = \vert a \vert \vert b \vert$. On the other hand, if the two vectors are perpendicular, the whole dot product becomes 0 because $\rm cos \theta = \rm cos 90^{\circ} = 0$.

Real world example
So what does the dot product really mean to us? How can we use it in the real life?
Imagine you are in a grocery store. You want to buy 1 apple, 2 oranges, and 3 bananas. The unit prices are \$1, \$2, \$0.5, respectively.

Fig 2. Apple, orange, and banana (image source)

You can define a number of items vector ($a$) and a unit price vector ($b$).

\[a = \begin{bmatrix}1 & 2 & 3 \end{bmatrix}\\ b = \begin{bmatrix}\$1 & \$2 & \$0.5\end{bmatrix}\\\]

The total cost will be the dot product of the two vectors:

\[ab^\intercal = \begin{bmatrix} 1 & 2 & 3 \end{bmatrix} \begin{bmatrix} \$1 \\ \$2 \\ \$0.5 \end{bmatrix} =1 \times \$1 + 2 \times \$2 + 3 \times \$0.5 = \$6.5 \\\]

Ta-da! Your total is $6.5! Now we understand the dot product is something useful in our life, right?

2. What is matrix multiplication?

Now that we know what the dot product is, let’s talk about matrix multiplication. How is it different from dot product?

Matrix multiplication is basically a matrix version of the dot product. Remember the result of dot product is a scalar. The result of matrix multiplication is a matrix, whose elements are the dot products of pairs of vectors in each matrix.

Fig 3. Matrix multiplication (image source)

Note that the number of columns in $A$ and the number of rows in $B$ should match; $A: (m \times n)$, $B: (n \times k) $.

Grocery example
Let’s go back to the previous grocery store example. Now there are two people who want to buy different numbers of apples, oranges, and bananas.

Person 1 wants 1 of each fruit: $a_1 = [1 \ \ 1 \ \ 1]$
Person 2 wants 10 of each fruit: $a_2 = [10 \ \ 10 \ \ 10]$

How much should each person pay? Can we repeat the dot product? Absolutely! But instead of doing the dot product twice, we can stack up the vectors to build a matrix and that’s simply the matrix multiplication!

The number of apples, oranges, and bananas to buy:

\[A= \begin{bmatrix} a_1\\ a_2 \end{bmatrix}= \begin{bmatrix} 1 & 1 & 1\\ 10 & 10 & 10\\ \end{bmatrix}\]

Now, for the unit price vector $b$, we need to transpose b to make it a column vector.

\[B = \begin{bmatrix} \$1\\ \$2\\ \$0.5 \end{bmatrix}\]

Now the total price each person has to pay is:

\[A \cdot B = \begin{bmatrix} 1 & 1 & 1\\ 10 & 10 & 10 \end{bmatrix} \begin{bmatrix} \$1\\ \$2\\ \$0.5 \end{bmatrix} = \begin{bmatrix} 1 \times \$1 + 1 \times \$2 + 1 \times \$0.5 \\ 10 \times \$1 + 10 \times \$2 + 10 \times \$0.5 \end{bmatrix} = \begin{bmatrix} \$3.5 \\ \$35 \end{bmatrix}\]

YAY :tada:! With just one simple matrix multiplication, we came up with that person 1 should pay \$3.5 and person 2 should pay \$35! You will now use matrix multiplication when you go to a grocery shopping, right? :wink:

Minkyung’s blog

Feature Scaling: Standardization vs. Normalization And Various Types of Normalization

Table of Contents

Why Do We Need Scaling?

Standardization vs. Normalization

Scalers Deep Dive

Original Data

1. Standardization

2. Normalization

2.1. Min-max Normalization

2.2. Maximum absolute normalization

2.3. Mean Normalization

2.4. Median-quantile Normalization

Summary

References

How To Use FB Prophet for Time-series Forecasting: Vehicle Traffic Volume

Table of Contents

1. Prepare Data

2. Train And Predict

3. Check Components

4. Evaluate

4.1. Evaluate the model on one test set

4.2. Cross validation

5. Trend Change Points

6. Seasonality Mode

7. Saving Model

8. References

Getting “Failed to build gem native extension” Error After Upgrading to Mac OS Big Sur

Problem

Solution

Some other things I tried

Finding The Best Feature Engineering Strategy Using sklearn GridSearchCV

Table of Conents

1. Prepare Data

2. Setup a Base Pipeline

2.1. Define Pipelines

2.2. Fit Pipeline

2.3. Diffferent Parameters To Test

3. Finding The Best Imputation Technique Using GridSearchCV

3.1. What Is GridSearchCV?

3.2. Defining Nested Parameters

3.3. Defining and Fitting GridSearchCV

3.4. Checking the results

4. References

Combining Feature Engineering and Model Fitting (Pipeline vs. ColumnTransformer)

Table of Conents

0. Prepare Data

1. Put Transformers and an Estimator Together: Pipeline

2. Apply Transformers to Different Columns: ColumnTransformer

3. Separate Feature Engineering Pipelines for Numerical and Categorical Variables

4. Final Pipeline

Summary

References

Missing Data Imputation Using sklearn

Contents

Why does missing data matter?

What are the options for missing data imputation?

Missing data imputation using scikit-learn

(0) Prepare data

(1) Mean/median

Assumptions

Pros

Cons

(2) Mode (most frequent category)

Assumptions

Pros

Cons

(3) Arbitrary value

Assumptions

Pros

Cons

Categorical

Numerical

(4) KNN imputer

Pros

Cons

(5) Adding Missing Indicator

Assumptions

Pros

Cons