Combining Feature Engineering and Model Fitting (Pipeline vs. ColumnTransformer)
In the previous post, we learned about
various missing data imputation strategies using
scikit-learn. Before diving
into finding the best imputation method for a given problem, I would like to first introduce two scikit-learn
classes, Pipeline
and ColumnTransformer
.
Both Pipeline
amd ColumnTransformer
are used to combine different transformers (i.e. feature engineering steps such
as
SimpleImputer
and OneHotEncoder
) to transform data. However, there are two major differences between them:
1. Pipeline
can be used for both/either of transformer and estimator (model) vs. ColumnTransformer
is only for
transformers
2. Pipeline
is sequential vs. ColumnTransformer
is parallel/independent
Don’t worry if this sounds too complicated! I will walk you through what I mean by the above statements with code examples. I had a lot of fun while digging into these two classes, so I hope you enjoy and find it useful at the end as well!
Table of Conents
- Prepare Data
- Put Transformers and an Estimator Together: Pipeline
- Apply Transformers to Different Columns: ColumnTransformer
- Separate Feature Engineering Pipelines for Numerical and Categorical Variables
- Final Pipeline
- Summary
- References
0. Prepare Data
Let’s first prepare the house price data from Kaggle we will be using in this post. The data is preprocessed by
replacing '?'
with NaN
. Do not forget to split the data into train and test sets before performing any feature engineering steps to avoid data leakage!
import pandas as pd
# preparing data
from sklearn.model_selection import train_test_split
# feature engineering: imputation, scaling, encoding
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# putting together in pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# model to use
from sklearn.linear_model import Lasso
# import house price data
df = pd.read_csv('../data/house_price/train.csv', index_col='Id')
# numerical columns vs. categorical columns
num_cols = df.drop('SalePrice', axis=1).select_dtypes('number').columns
cat_cols = df.drop('SalePrice', axis=1).select_dtypes('object').columns
# split train and test dataset
X_train, X_test, y_train, y_test = train_test_split(df.drop('SalePrice', axis=1),
df['SalePrice'],
test_size=0.3,
random_state=0)
# check the size of train and test data
X_train.shape, X_test.shape
((1022, 79), (438, 79))
X_train.head()
MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
65 | 60 | RL | NaN | 9375 | Pave | NaN | Reg | Lvl | AllPub | Inside | ... | 0 | 0 | NaN | GdPrv | NaN | 0 | 2 | 2009 | WD | Normal |
683 | 120 | RL | NaN | 2887 | Pave | NaN | Reg | HLS | AllPub | Inside | ... | 0 | 0 | NaN | NaN | NaN | 0 | 11 | 2008 | WD | Normal |
961 | 20 | RL | 50.0 | 7207 | Pave | NaN | IR1 | Lvl | AllPub | Inside | ... | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2010 | WD | Normal |
1385 | 50 | RL | 60.0 | 9060 | Pave | NaN | Reg | Lvl | AllPub | Inside | ... | 0 | 0 | NaN | MnPrv | NaN | 0 | 10 | 2009 | WD | Normal |
1101 | 30 | RL | 60.0 | 8400 | Pave | NaN | Reg | Bnk | AllPub | Inside | ... | 0 | 0 | NaN | NaN | NaN | 0 | 1 | 2009 | WD | Normal |
5 rows × 79 columns
1. Put Transformers and an Estimator Together: Pipeline
Let’s say we want to train a Lasso regression model that predicts SalePrice
. Instead of using all of the 79 variables we have, let’s use only numerical variables this time.
I already know there is plenty of missing data in some columns (e.g. LotFrontage
, MasVnrArea
, and GarageYrBlt
among numerical columns), so we want to perform missing data imputation before fitting a model. Also, let’s say we also want to scale the data using StandardScaler
because the scale of variables is all different.
This is what we would do normally to fit a model:
# take only numerical data
X_temp = X_train[num_cols].copy()
# missing data imputation
imputer = SimpleImputer(strategy='mean')
X_impute = imputer.fit_transform(X_temp) # np.ndarray
X_impute = pd.DataFrame(X_impute, columns=X_temp.columns) # pd.DataFrame
# scale data
scaler = StandardScaler()
X_scale = scaler.fit_transform(X_impute) # np.ndarray
X_scale = pd.DataFrame(X_scale, columns=X_temp.columns) # pd.DataFrame
# fit model
lasso = Lasso()
lasso.fit(X_scale, y_train)
lasso.score(X_scale, y_train)
0.8419801151434141
This is great but we have to manually move data from one step to another: we pass the output of the first step (SimpleImputer
) to the second step (StandardScaler
) as an input (X_impute
). And then, the output of the second step (StandardScaler
) is passed to the third step (Lasso
) as an input (X_scale
). If we have more feature engineering steps, it will be more complex to handle different inputs and outputs. So, here Pipeline
comes to the rescue!
With Pipeline
, you can combine transformers and an estimator (model) together. You can transform your data and then fit a model with the transformed data. You just need to pass a list of tuples defining the steps in order: (step_name, transformer or estimator object). Let’s rewrite the same logic using Pipeline
.
# define feature engineering and model together
pipe = Pipeline([('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('lasso', Lasso())])
# fit model
pipe.fit(X_temp, y_train)
pipe.score(X_temp, y_train)
0.8419801151434141
Awesome! We saved a lot of lines and it looks much cleaner and more understandable! As you can see, Pipeline passes the first step’s output to the next step as its input, meaning Pipeline is sequential.
2. Apply Transformers to Different Columns: ColumnTransformer
Let’s go back to our original dataset where we had both numerical and categorical variables. Because we cannot apply mean imputation to categorical variables (there is no ‘mean’ in categories!), we would want to use something different. One of the commonly used techniques is mode imputation (filling with the most frequent category), so let’s use that.
Mean imputation for numerical variables and mode imputation for categorical variables - can we do this in Pipeline as below?
# Can we do this?
pipe = Pipeline([('num_imputer', SimpleImputer(strategy='mean')),
('cat_imputer', SimpleImputer(strategy='most_frequent')),
('lasso', Lasso())])
pipe.fit(X_train, y_train)
Unfortunately, no! If you run the above code, it will throw an error like this:
ValueError: Cannot use mean strategy with non-numeric data:
could not convert string to float: 'RL'
The error happens when Pipeline
attempts to apply mean imputation to all of the columns including a categorical
variable that contains a string category called 'RL'
. Remember mean imputation can only be applied to numerical variables so our SimpleImputer(strategy='mean')
freaked out!
We need to let our Pipeline
know which columns to apply which transformer. How do we do that? We do it with ColumnTransformer
!
ColumnTransformer
is similar to Pipeline
in the sense that you put transformers together as a list of tuples, but
in this time, you pass one more argument: a list of the column names you want to apply a transformer.
# applying different transformers to different columns
transformer = ColumnTransformer(
[('numerical', SimpleImputer(strategy='mean'), num_cols),
('categorical', SimpleImputer(strategy='most_frequent'), cat_cols)])
# fit transformer with out train data
transformer.fit(X_train)
# transform the train data and create a DataFrame with the transformed data
X_train_transformed = transformer.transform(X_train)
X_train_transformed = pd.DataFrame(X_train_transformed,
columns=list(num_cols) + list(cat_cols))
You may have noticed we defined the output columns to be list(num_cols) + list(cat_cols)
, not X_train.columns
. This is because ColumnTransformer
fits each transformer independently in parallel and concatenates all of the outputs at the end.
That is, ColumnTransformer
takes only numerical columns (num_cols
), fits and transforms them using
SimpleImputer(strategy='mean')
, sets the output aside. At the same time, it does the same thing for categorical
columns (cat_cols
) with SimpleImputer(strategy='most_frequent')
. When it is done with each and every step, it
combines all of the two outputs in the order that the transformers are performed. Therefore, be aware of the column orders because the final output may be different from your original DataFrame!
Note that ColumnTransformer
can only be used for transformers, not estimators. We cannot include Lasso()
and fit the model as we did with Pipeline
. ColumnTransformer
is only used for data pre-processing, so there is no predict
or score
as in Pipeline
. To train a model and calculate a performance score, we will need Pipeline
again.
3. Separate Feature Engineering Pipelines for Numerical and Categorical Variables
Let’s go one step further and include more feature engineering steps. In addition to the missing data imputation, we
also want to scale our numerical variables using StandardScaler
and encode the categorical variables using
OneHotEncoder
. Can we do something like this then?
# Can we do this?
transformer = ColumnTransformer(
[('numerical_imputer', SimpleImputer(strategy='mean'), num_cols),
('numerical_scaler', StandardScaler(), num_cols),
('categorical_imputer', SimpleImputer(strategy='most_frequent'), cat_cols),
('categorical_encoder', OneHotEncoder(handle_unknown='ignore'), cat_cols)])
transformer.fit(X_train)
No!
As we saw in the previous section, each step in ColumnTransformer
is independent. Therefore, the input for the OneHotEncoder()
is not the output of the SimpleImputer(strategy='most_frequent')
but just a subset of the original DataFrame (cat_cols
) which is not imputed. You cannot one-hot-encode a categorical variable that has missing data.
We need something that can sequentially pass data throughout multiple feature engineering steps. Sequentially moving data… sounds familiar, right? Yes, you can do this with Pipeline
!
However, we need to create a feature engineering pipeline for numerical variables and categorical variables separately. So, we can come up with something like this:
# feature engineering pipeline for numerical variables
num_pipeline= Pipeline([('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())])
# feature engineering pipeline for categorical variables
cat_pipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))])
You can think it as creating a ‘new transformer’ that combines multiple transformers for each type of variable. Doesn’t it sounds cool?
4. Final Pipeline
Okay. Now that we have feature engineering pipelines defined for both numerical variables and categorical variables, we can put things together to train a Lasso model using ColumnTransformer
and Pipeline
.
# put numerical and categorical feature engineering pipelines together
preprocessor = ColumnTransformer([("num_pipeline", num_pipeline, num_cols),
("cat_pipeline", cat_pipeline, cat_cols)])
# put transformers and an estimator together
pipe = Pipeline([('preprocessing', preprocessor),
('lasso', Lasso(max_iter=10000))]) # increased max_iter to converge
# fit model
pipe.fit(X_train, y_train)
pipe.score(X_train, y_train)
0.9483539967729575
This is very neat! We applied different sets of feature engineering steps to numercial and categorical variables and then trained a model in only a few lines of code.
Thinking of how long and complex the code would be without ColumnTransformer
and Pipeline
, aren’t you tempted to try this out right now?
Summary
In this post, we looked at how to combine feature engineering steps and a model fitting step together using Pipeline
and ColumnTransformer
. Especially we learned that we can use
Pipeline
for combining transformers and an estimatorColumnTransformer
for applying different transformers to different columnsPipeline
for creating different feature engineering pipelines for numerical and categorical variables that sequentially apply a different set of transformers
Also, check out the table below to recap the differences between Pipeline
vs. ColumnTransformer
:
Pipeline | ColumnTransformer | |
---|---|---|
Used for | Both/either of transformers and estimator | Transformers only |
Main methods | fit, transform, predict, and score | fit, and transform (no predict or score) |
Can pick columns to apply | No | Yes |
Each step is performed | Sequentially | Independently |
Transformed output columns | Same as input | May differ depending on the defined steps |