Jekyll2021-02-03T05:09:23+00:00/feed.xmlMinkyung’s blogWelcome to Minkyung's blog!Feature Scaling: Standardization vs. Normalization And Various Types of Normalization2020-12-27T04:00:00+00:002020-12-27T04:00:00+00:00/python/2020/12/27/feature-scaling<h1 id="table-of-contents">Table of Contents</h1>
<ul>
<li><a href="#why">Why Do We Need Scaling?</a></li>
<li><a href="#std_vs_norm">Standardization vs. Normalization</a></li>
<li><a href="#deepdive">Scalers Deep Dive</a>
<ul>
<li><a href="#org">Original Data</a></li>
<li><a href="#std">1. Standardization</a></li>
<li><a href="#norm">2. Normalization</a>
<ul>
<li><a href="#minmax">2.1. Min-max Normalization</a></li>
<li><a href="#maxabs">2.2. Maximum absolute normalization</a></li>
<li><a href="#mean">2.3. Mean Normalization</a></li>
<li><a href="#robust">2.4. Median-quantile Normalization</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#summary">Summary</a></li>
<li><a href="#ref">References</a></li>
</ul>
<p><a id="why"></a></p>
<h1 id="why-do-we-need-scaling">Why Do We Need Scaling?</h1>
<p>Features on a different scale is a common issue data scientists encounter. Some algorithms can handle it but some don’t and if features are not scaled properly beforehand, we will have a hard time finding the optimal solution. So, why does it matter and how can we solve it?</p>
<p>First, let’s think about KNN, which uses Euclidean distance to determine the similarity between points. When calculating the distance, features with a bigger magnitude will influence much more than the ones with a smaller magnitude, which leads to a solution dominated by the bigger features. Another example is algorithms using gradient descent. Features on a different scale will have different step sizes and it will take a longer time to converge as shown below.</p>
<p><br /></p>
<div style="text-align:center">
<img src="/images/2020-12-27-feature-scaling/scaling.png" width="60%" />
<figcaption> Gradient descent without scaling (left) and with scaling (right) (<a href="https://stackoverflow.com/a/46688787/9449085">Image source</a>) </figcaption>
</div>
<p><br /></p>
<p>The almost only exception is tree-based algorithms that use Gini impurity or information gain which are not influenced by feature scale. Here is some examples of machine learning models sensitive and non-sensitive to feature scale:</p>
<p><strong>ML Models sensitive to feature scale</strong></p>
<ul>
<li>Algorithms that use gradient descent as an optimization technique
<ul>
<li>Linear and Logistic Regression (may not use Gradient Descent)</li>
<li>Neural Networks</li>
</ul>
</li>
<li>Distance-based algorithms
<ul>
<li>Support Vector Machines</li>
<li>KNN</li>
<li>K-means clustering</li>
</ul>
</li>
<li>Algorithms that find directions that maximize the variance
<ul>
<li>Principal Component Analysis (PCA)</li>
<li>Linear Discriminant Analysis (LDA)</li>
</ul>
</li>
</ul>
<p><strong>ML models not sensitive to feature scale</strong></p>
<ul>
<li>Tree-based algorithms
<ul>
<li>Decision Tree</li>
<li>Random Forest</li>
<li>Gradient Boosted Trees</li>
</ul>
</li>
</ul>
<p><a id="std_vs_norm"></a></p>
<h1 id="standardization-vs-normalization">Standardization vs. Normalization</h1>
<p>How can we scale features then? There are two types of scaling techniques depending on their focus: 1) standardization and 2) normalization.</p>
<p><strong>Standardization</strong> focuses on scaling the <strong><em>variance</em></strong> in addition to shifting the center to 0. It comes from the standardization in statistics, which converts a variable into $z-{score}$ that represents the number of standard deviations away from the mean no matter what the original value is.</p>
<p><strong>Normalization</strong> focuses on scaling the <strong><em>min-max range</em></strong> rather than variance. For example, the original value range of [100, 200] is simply scaled to be [0, 1] by substracting the minimum value and dividing by the range. There are a few variations of normalization depending on whether it centers the data and what min/max value it uses: 1) min-max normalization, 2) max-abs normalization, 3) mean normalization, and 4) median-quantile normalization.</p>
<p>Each scaling method has its own advantages and limitations and there is no method that works for every situation. We should understand each method, implement them, and see which one works best for a specific problem. In the remaining sections of this post, I will explain the definition, advantages, limitations, and Python implementation of all of the mentioned scaling methods.</p>
<p><a id="deepdive"></a></p>
<h1 id="scalers-deep-dive">Scalers Deep Dive</h1>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># define functions used in this post
</span>
<span class="k">def</span> <span class="nf">kdeplot</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">scaler_name</span><span class="p">):</span>
<span class="n">fix</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="k">for</span> <span class="n">feature</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">:</span>
<span class="n">sns</span><span class="p">.</span><span class="n">kdeplot</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">feature</span><span class="p">],</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="n">scaler_name</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Density'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Feature value'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">kdeplot_with_zoom</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">scaler_name</span><span class="p">,</span> <span class="n">xlim</span><span class="p">):</span>
<span class="n">fix</span><span class="p">,</span> <span class="p">(</span><span class="n">ax1</span><span class="p">,</span> <span class="n">ax2</span><span class="p">)</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="c1"># original
</span> <span class="k">for</span> <span class="n">feature</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">:</span>
<span class="n">sns</span><span class="p">.</span><span class="n">kdeplot</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">feature</span><span class="p">],</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax1</span><span class="p">)</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="n">scaler_name</span><span class="p">)</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Density'</span><span class="p">)</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Feature value'</span><span class="p">)</span>
<span class="c1"># zoomed
</span> <span class="k">for</span> <span class="n">feature</span> <span class="ow">in</span> <span class="n">features</span><span class="p">:</span>
<span class="n">sns</span><span class="p">.</span><span class="n">kdeplot</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">feature</span><span class="p">],</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax2</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">set_xlim</span><span class="p">(</span><span class="n">xlim</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="n">scaler_name</span> <span class="o">+</span> <span class="s">' (zoomed)'</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Density'</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Feature value'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
</code></pre></div></div>
<p><a id="org"></a></p>
<h2 id="original-data">Original Data</h2>
<p>We will use the boston housing-prices dataset available in sklearn library to demonstrate the effect of each scaler. Among total 13 variables, we will focus on 6 variables for easier visualization: ‘RM’, ‘LSTAT’, ‘CRIM’, ‘AGE’, ‘DIS’, ‘NOX’. As always, we split the data into train and test sets and use the train set for feature engineering to prevent data leakage during testing although we will not cover testing in this post.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># import modules
</span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">load_boston</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="c1"># load data
</span><span class="n">boston_dataset</span> <span class="o">=</span> <span class="n">load_boston</span><span class="p">()</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">boston_dataset</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">boston_dataset</span><span class="p">.</span><span class="n">feature_names</span><span class="p">)</span>
<span class="c1"># take only variables we will experiment
</span><span class="n">features</span> <span class="o">=</span> <span class="p">[</span><span class="s">'RM'</span><span class="p">,</span> <span class="s">'LSTAT'</span><span class="p">,</span> <span class="s">'CRIM'</span><span class="p">,</span> <span class="s">'AGE'</span><span class="p">,</span> <span class="s">'DIS'</span><span class="p">,</span> <span class="s">'NOX'</span><span class="p">]</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">features</span><span class="p">]</span>
<span class="n">df</span><span class="p">[</span><span class="s">'MEDV'</span><span class="p">]</span> <span class="o">=</span> <span class="n">boston_dataset</span><span class="p">.</span><span class="n">target</span> <span class="c1"># add target
</span>
<span class="c1"># split data
</span><span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">features</span><span class="p">],</span>
<span class="n">df</span><span class="p">[</span><span class="s">'MEDV'</span><span class="p">],</span>
<span class="n">test_size</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span>
<span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>
<p>When the original distributions of all features are displayed in one plot, we can quickly tell that they are not on the same scale. Some features seem to be clustered in a smaller range, such as ‘NOX’ or ‘RM’, and some are spread across a wider range, such as ‘LSTAT’.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">kdeplot</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="s">'Original'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/images/2020-12-27-feature-scaling/output_10_0.png" alt="png" /></p>
<p>To quantify the difference in scale between features, we can check some statistics such as mean, standard deviation, minimum, or maximum of observations within each feature. Indeed, they are all very different in their scale and this will be a problem when training certain types of model that requires data to be on the same scale.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X_train</span><span class="p">.</span><span class="n">describe</span><span class="p">().</span><span class="n">loc</span><span class="p">[[</span><span class="s">'mean'</span><span class="p">,</span> <span class="s">'std'</span><span class="p">,</span> <span class="s">'min'</span><span class="p">,</span> <span class="s">'max'</span><span class="p">],</span> <span class="p">:]</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>RM</th>
<th>LSTAT</th>
<th>CRIM</th>
<th>AGE</th>
<th>DIS</th>
<th>NOX</th>
</tr>
</thead>
<tbody>
<tr>
<th>mean</th>
<td>6.308427</td>
<td>12.440650</td>
<td>3.358284</td>
<td>68.994068</td>
<td>3.762459</td>
<td>0.556098</td>
</tr>
<tr>
<th>std</th>
<td>0.702009</td>
<td>7.078485</td>
<td>8.353223</td>
<td>28.038429</td>
<td>2.067661</td>
<td>0.115601</td>
</tr>
<tr>
<th>min</th>
<td>3.561000</td>
<td>1.730000</td>
<td>0.006320</td>
<td>2.900000</td>
<td>1.174200</td>
<td>0.385000</td>
</tr>
<tr>
<th>max</th>
<td>8.780000</td>
<td>36.980000</td>
<td>88.976200</td>
<td>100.000000</td>
<td>12.126500</td>
<td>0.871000</td>
</tr>
</tbody>
</table>
</div>
<p><a id="std"></a></p>
<h2 id="1-standardization">1. Standardization</h2>
<p>One of the most commonly used techniques is standardization, which scales data so different features have the same mean and standard deviation.</p>
<p><strong>Definition</strong></p>
<ul>
<li>Center data at 0 and set the standard deviation to 1 (variance=1)</li>
</ul>
\[X' = \frac{X - \mu}{\sigma}\]
<p>where $\mu$ is the mean of the feature and $\sigma$ is the standard deviation of the feature</p>
<ul>
<li>The output value is also called Z-score which represents how many standard deviations a value is away from the mean of the feature</li>
</ul>
<p><strong>Advantages</strong></p>
<ul>
<li>All features have the same mean and variance, making it easier to compare</li>
<li>It is less sensitive to extreme outliers than min-max normalizer</li>
<li>It preserves the original distribution (If the original distribution is normal distribution, the transformed data will also be normal distribution. Same for skewed distribution)</li>
</ul>
<p><strong>Limitations</strong></p>
<ul>
<li>It works best when the original distribution is normal distribution (recommended to transform data to normal distribution beforehand)</li>
<li>It is still affected by outliers as the mean and standard deviation used in the formula is affected by extreme outliers</li>
<li>There is no fixed bounding range and features have different ranges</li>
<li>It preserves outliers</li>
</ul>
<p>Let’s see the standardization output of our data shown below. Every feature is centered at around 0 and with the same variance (look at the width of main curves). However, the x limit of each variable differs, especially the ones with extreme outliers such as ‘CRIM’ have a much longer tail tapping 10. These extreme outliers might work adversely when training a model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">StandardScaler</span>
<span class="c1"># fit the scaler to the train set
</span><span class="n">scaler_std</span> <span class="o">=</span> <span class="n">StandardScaler</span><span class="p">().</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="c1"># transform data
</span><span class="n">X_train_scaled_std</span> <span class="o">=</span> <span class="n">scaler_std</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="c1"># put them in dataframe
</span><span class="n">X_train_scaled_std</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_train_scaled_std</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">X_train</span><span class="p">.</span><span class="n">columns</span><span class="p">)</span>
<span class="c1"># plot
</span><span class="n">kdeplot</span><span class="p">(</span><span class="n">X_train_scaled_std</span><span class="p">,</span> <span class="s">'StandardScaler'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/images/2020-12-27-feature-scaling/output_15_0.png" alt="png" /></p>
<p><a id="norm"></a></p>
<h2 id="2-normalization">2. Normalization</h2>
<p>Normalization overcomes standardization’s limitation of varying range across features by focusing on limiting the bounding range. The main idea is dividing the values by the maximum or the total range of variables so that every value lies within a fixed range.</p>
<p><a id="minmax"></a></p>
<h3 id="21-min-max-normalization">2.1. Min-max Normalization</h3>
<p><strong>Definition</strong></p>
<ul>
<li>Scale the feature so it has a fixed range such as [0, 1]</li>
</ul>
\[X' = \frac{X - \text{min}(X)}{\text{max}(X) - \text{min}(X)}\]
<p><strong>Advantages</strong></p>
<ul>
<li>Every feature has the same range of [0, 1], removing potentially negative impacts of extreme values</li>
</ul>
<p><strong>Limitations</strong></p>
<ul>
<li>The mean and variance vary between features</li>
<li>It may alter the shape of the original distribution</li>
<li>It is sensitive to extreme outliers</li>
<li>The majority of data will be centered within a small range if there are extreme outliers</li>
</ul>
<p>When applying this to our data, we can see every feature is now within the same range. Note that this graph looks like there are values smaller than 0 and greater than 1, but it is because we are estimating density function from our non-smooth data and the actual values fall within 0 and 1 as you can see in the summary table.</p>
<p>You can also see ‘CRIM’ has a majority of observations at around 0 and quickly fades out after that. This is because of extreme outliers. It would be wise to remove those outliers beforehand so the values are spread more evenly, which will help training.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">MinMaxScaler</span>
<span class="c1"># fit the scaler to the train set
</span><span class="n">scaler_minmax</span> <span class="o">=</span> <span class="n">MinMaxScaler</span><span class="p">().</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="c1"># transform data
</span><span class="n">X_train_scaled_minmax</span> <span class="o">=</span> <span class="n">scaler_minmax</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="c1"># put them in dataframe
</span><span class="n">X_train_scaled_minmax</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_train_scaled_minmax</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">X_train</span><span class="p">.</span><span class="n">columns</span><span class="p">)</span>
<span class="c1"># plot
</span><span class="n">kdeplot</span><span class="p">(</span><span class="n">X_train_scaled_minmax</span><span class="p">,</span> <span class="s">'Min-max Normalization'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/images/2020-12-27-feature-scaling/output_20_0.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># summary table shows min 0 and max 1 for every feature
</span><span class="n">X_train_scaled_minmax</span><span class="p">.</span><span class="n">describe</span><span class="p">().</span><span class="n">loc</span><span class="p">[[</span><span class="s">'mean'</span><span class="p">,</span> <span class="s">'std'</span><span class="p">,</span> <span class="s">'min'</span><span class="p">,</span> <span class="s">'max'</span><span class="p">],</span> <span class="p">:]</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>RM</th>
<th>LSTAT</th>
<th>CRIM</th>
<th>AGE</th>
<th>DIS</th>
<th>NOX</th>
</tr>
</thead>
<tbody>
<tr>
<th>mean</th>
<td>0.526428</td>
<td>0.303848</td>
<td>0.037675</td>
<td>0.680680</td>
<td>0.236321</td>
<td>0.352054</td>
</tr>
<tr>
<th>std</th>
<td>0.134510</td>
<td>0.200808</td>
<td>0.093888</td>
<td>0.288758</td>
<td>0.188788</td>
<td>0.237861</td>
</tr>
<tr>
<th>min</th>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
</tr>
<tr>
<th>max</th>
<td>1.000000</td>
<td>1.000000</td>
<td>1.000000</td>
<td>1.000000</td>
<td>1.000000</td>
<td>1.000000</td>
</tr>
</tbody>
</table>
</div>
<p><a id="maxabs"></a></p>
<h3 id="22-maximum-absolute-normalization">2.2. Maximum absolute normalization</h3>
<p>When there are both positive and negative values, it might be wise to keep the sign and only scale the magnitude, so the range becomes roughly [-1, 1]. For instance, if the original feature range is [-50, 50], then we can map it to [-1, 1] by simply dividing the values by the maximum absolute value. This is where the max-abs normalizer comes in.</p>
<p><strong>Definition</strong></p>
<ul>
<li>Scale the feature so it has a fixed range such as [-1, 1]</li>
</ul>
\[X' = \frac{X}{max(\lvert X \lvert)}\]
<ul>
<li>This is the same as min-max normalizer if the minimum value is 0 and all values are positive</li>
</ul>
<p><strong>Advantages</strong></p>
<ul>
<li>It is handy for features with both positive and negative values as it keeps the sign of values</li>
<li>It does not shift or center the data, so it does not destroy any sparsity. This technique is often used in sparse data.</li>
</ul>
<p><strong>Limitations</strong></p>
<ul>
<li>The mean and variance vary between features</li>
<li>It may alter the shape of the original distribution</li>
<li>It is sensitive to extreme outliers</li>
<li>The majority of data will be centered within a small range if there are extreme outliers</li>
</ul>
<p>The graph below is almost the same as the result of the min-max normalizer, as all of the features are positive values.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">MaxAbsScaler</span>
<span class="n">scaler_maxabs</span> <span class="o">=</span> <span class="n">MaxAbsScaler</span><span class="p">().</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="c1"># transform data
</span><span class="n">X_train_scaled_maxabs</span> <span class="o">=</span> <span class="n">scaler_maxabs</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="c1"># put them in dataframe
</span><span class="n">X_train_scaled_maxabs</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_train_scaled_maxabs</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">X_train</span><span class="p">.</span><span class="n">columns</span><span class="p">)</span>
<span class="c1"># plot
</span><span class="n">kdeplot</span><span class="p">(</span><span class="n">X_train_scaled_maxabs</span><span class="p">,</span> <span class="s">'Maximum Absolute Scaler'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/images/2020-12-27-feature-scaling/output_24_0.png" alt="png" /></p>
<p><a id="mean"></a></p>
<h3 id="23-mean-normalization">2.3. Mean Normalization</h3>
<p>The mean normalizer is the same as the min-max normalizer but, instead of setting the minimum to 0, it sets the mean to 0.</p>
<p><strong>Definition</strong></p>
<ul>
<li>Center the feature at 0 and rescale the feature to [-1, 1]</li>
</ul>
\[X' = \frac{X-\mu}{\text{max}(X) - \text{min}(X)}\]
<p><strong>Advantages</strong></p>
<ul>
<li>Every feature has the same range of [-1, 1], removing potentially negative impacts of extreme values</li>
</ul>
<p><strong>Limitations</strong></p>
<ul>
<li>The mean and variance vary between features</li>
<li>It may alter the shape of the original distribution</li>
<li>It is sensitive to extreme outliers</li>
<li>The majority of data will be centered within a small range if there are extreme outliers</li>
</ul>
<p>Unfortunately, there is no specialized function for mean normalization in scikit-learn. Instead, we can use the combination of StandardScaler to remove the mean and RobustScaler to dividing the values by the total value range.</p>
<p>You can see that now all features are centered around 0 while keeping the min-max range the same across them. This will be handy when applying machine learning models. However, the variance still varies across them, keeping ones with extreme outliers (e.g. ‘CRIM’) mostly clustered at 0.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">StandardScaler</span>
<span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">RobustScaler</span>
<span class="c1"># StandardScaler to remove the mean but not scale
</span><span class="n">scaler_mean</span> <span class="o">=</span> <span class="n">StandardScaler</span><span class="p">(</span><span class="n">with_mean</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">with_std</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="c1"># RobustScaler to divide values by max-min
# Important to keep the quantile range to 0 to 100 (min and max values)
</span><span class="n">scaler_minmax</span> <span class="o">=</span> <span class="n">RobustScaler</span><span class="p">(</span><span class="n">with_centering</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="n">with_scaling</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="n">quantile_range</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">))</span>
<span class="c1"># fit the scaler to the train set
</span><span class="n">scaler_mean</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">scaler_minmax</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="c1"># transform train and test sets
</span><span class="n">X_train_scaled</span> <span class="o">=</span> <span class="n">scaler_minmax</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">scaler_mean</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">))</span>
<span class="c1"># put them in dataframe
</span><span class="n">X_train_scaled</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_train_scaled</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">X_train</span><span class="p">.</span><span class="n">columns</span><span class="p">)</span>
<span class="c1"># plot
</span><span class="n">kdeplot</span><span class="p">(</span><span class="n">X_train_scaled</span><span class="p">,</span> <span class="s">'Mean Normalization'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/images/2020-12-27-feature-scaling/output_27_0.png" alt="png" /></p>
<p><a id="robust"></a></p>
<h3 id="24-median-quantile-normalization">2.4. Median-quantile Normalization</h3>
<p>The final method is median-quantile normalization, which is also called a robust scaler. It is called robust because it is robust to extreme outliers.</p>
<p><strong>Definition</strong></p>
<ul>
<li>Set the median to 0 and scale to the inter-quantile range (range between 25th quantile and 75th quantile)</li>
</ul>
\[X' = \frac{X-\text{median}(X)}{\text{75th quantile}(X) - \text{25th quantile}(X)}\]
<p><strong>Advantages</strong></p>
<ul>
<li>It is robust to outliers so it is used for data with outliers</li>
<li>It produces a better spread of data for skewed distribution</li>
</ul>
<p><strong>Limitations</strong></p>
<ul>
<li>The variance and value range differs between features</li>
<li>It may not preserve the shape of the original distribution</li>
<li>It preserves outliers</li>
</ul>
<p>As you can see in the right graph below, all of the transformed data has better spread and none of the features shows high concentration within a small range, unlike other scaling techniques we reviewed so far. However, like the graph on the left, some features with extreme outliers (e.g. ‘CRIM’) show a very wide range of values. This method does not set the fixed value range so the extreme values still exist in the data.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">RobustScaler</span>
<span class="n">scaler_rbs</span> <span class="o">=</span> <span class="n">RobustScaler</span><span class="p">().</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">X_train_scaled_rbs</span> <span class="o">=</span> <span class="n">scaler_rbs</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">X_train_scaled_rbs</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_train_scaled_rbs</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">X_train</span><span class="p">.</span><span class="n">columns</span><span class="p">)</span>
<span class="n">kdeplot_with_zoom</span><span class="p">(</span><span class="n">X_train_scaled_rbs</span><span class="p">,</span> <span class="s">'Median-quantile Normalization'</span><span class="p">,</span> <span class="p">[</span><span class="o">-</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">])</span>
</code></pre></div></div>
<p><img src="/images/2020-12-27-feature-scaling/output_30_0.png" alt="png" /></p>
<p><a id="summary"></a></p>
<h1 id="summary">Summary</h1>
<p>We have reviewed five different scaling methods including one standardization and four normalization methods. Like mentioned earlier, there is no method that works for every problem. We would need to try different scalers and find the one that works best for a specific application. However, the rule of thumb is, try to use standardization or min-max normalization first and see if other methods or tweaks need to be applied. Some criteria to consider are: 1) does the algorithm prefer data to be centered at 0? 2) does the algorithm prefer data to be in a fixed range? Also, it is wise to handle outliers beforehand if necessary.</p>
<p>Here is the summary of the scaling methods reviewed in this post:</p>
<table>
<thead>
<tr>
<th>Category</th>
<th>Standardization</th>
<th>Min-max Normalization</th>
<th>Max-abs Normalization</th>
<th>Mean Normalization</th>
<th>Median-quantile Nromalization</th>
</tr>
</thead>
<tbody>
<tr>
<td>Concepts</td>
<td>Centering + Unit Variance</td>
<td>Fixed Range</td>
<td>Fixed Range</td>
<td>Centering + Fixed Range</td>
<td>Centering + Fixed Quantile Range</td>
</tr>
<tr>
<td>Definition</td>
<td>Convert data to have zero mean and unit variance</td>
<td>Convert data to be within fixed range (e.g. [0, 1])</td>
<td>Convert data to be within fixed range (e.g. [-1, 1])</td>
<td>Convert data to have zero mean and be within fixed range (e.g. [-1, 1])</td>
<td>Convert data to have zero median and unit interquantile range</td>
</tr>
<tr>
<td>Sklearn class</td>
<td>StandardScaler</td>
<td>MinMaxScaler</td>
<td>MaxAbsScaler</td>
<td>StandardScaler + RobustScaler<br /><br />* <em>StandardScaler only for mean removal and RobustScaler for scaling</em><br /></td>
<td>RobustScaler</td>
</tr>
<tr>
<td>Benefits</td>
<td>- Less sensitive to outliers <br />- Easier to compare and learn <br />- Preserves original distribution</td>
<td>- Features in the same range</td>
<td>- Features in the same range <br />- Preserves the sign (good for pos/neg mix)</td>
<td>- Features in the same range</td>
<td>- Least sensitive to outliers <br />- Better spread of values for skewed distribution</td>
</tr>
<tr>
<td>Limitations</td>
<td>- Range varies between variables <br />- Preserves outliers</td>
<td>- Sensitive to outliers <br />- Mean and variance vary between features</td>
<td>- Sensitive to outliers <br />- Mean and variance vary between features</td>
<td>- Sensitive to outliers <br />-Variance varies between features</td>
<td>- Range varies between features <br />-Variance varies between features</td>
</tr>
</tbody>
</table>
<p><a id="ref"></a></p>
<h1 id="references">References</h1>
<ul>
<li><a href="https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#maxabsscaler">sciki-learn: Compare the effect of different scalers on data with outliers</a></li>
<li><a href="https://en.wikipedia.org/wiki/Feature_scaling">Wikipedia: Feature Scaling</a></li>
<li><a href="https://statisticsbyjim.com/glossary/standardization/">Statistics By Jim: Standardization</a></li>
<li><a href="https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/">Analytics Vidhya: Feature Scaling for Machine Learning: Understanding the Difference Between Normalization vs. Standardization</a></li>
<li><a href="https://sebastianraschka.com/Articles/2014_about_feature_scaling.html#training-a-naive-bayes-classifier">Sebastian Raschka: About Feature Scaling and Normalization</a></li>
<li><a href="https://kharshit.github.io/blog/2018/03/23/scaling-vs-normalization">Technical Fridays: Scaling vs Normalization</a></li>
<li><a href="https://www.udemy.com/course/feature-engineering-for-machine-learning/">Feature Engineering for Machine Learning</a></li>
</ul>Table of Contents Why Do We Need Scaling? Standardization vs. Normalization Scalers Deep Dive Original Data 1. Standardization 2. Normalization 2.1. Min-max Normalization 2.2. Maximum absolute normalization 2.3. Mean Normalization 2.4. Median-quantile Normalization Summary ReferencesHow To Use FB Prophet for Time-series Forecasting: Vehicle Traffic Volume2020-12-15T04:00:00+00:002020-12-15T04:00:00+00:00/python/2020/12/15/prophet-intro<p>Recently, I came across a few articles mentioning Facebook’s Prophet library that looked interesting (although the
initial release was almost 3 years ago!), so I decided to dig more into it.</p>
<p>Prophet is an open-source library developed by Facebook which aims to make time-series forecasting easier and more
scalable.
It is a type of generalized additive model (GAM), which uses a regression model with potentially non-linear
smoothers. It is called additive because it adds multiple decomposed parts to explain some trends. For example,
Prophet uses the following components:</p>
\[y(t) = g(t) + s(t) + h(t) + e(t)\]
<p>where,<br />
$g(t)$: Growth. Big trend. Non-periodic changes. <br />
$s(t)$: Seasonality. Periodic changes (e.g. weekly, yearly, etc.) represented by Fourier Series.<br />
$h(t)$: Holiday effect that represents irregular schedules. <br />
$e(t)$: Error. Any idiosyncratic changes not explained by the model.</p>
<p>In this post, I will explore main concepts and API endpoints of the Prophet library.</p>
<h1 id="table-of-contents">Table of Contents</h1>
<ol>
<li><a href="#prep">Prepare Data</a></li>
<li><a href="#train">Train And Predict</a></li>
<li><a href="#components">Check Components</a></li>
<li><a href="#eval">Evaluate</a></li>
<li><a href="#trend">Trend Change Points</a></li>
<li><a href="#season">Seasonality Mode</a></li>
<li><a href="#save">Saving Model</a></li>
<li><a href="#ref">References</a></li>
</ol>
<p><a id="prep"></a></p>
<h1 id="1-prepare-data">1. Prepare Data</h1>
<p>In this post. We will use the U.S. traffic volume data available
<a href="https://fred.stlouisfed.org/series/TRFVOLUSM227NFWA">here</a>, which is a monthly traffic volume (miles traveled) on
public roadways from January 1970 until September 2020. The unit is a million miles.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="c1"># to mute Pandas warnings Prophet needs to fix
</span><span class="kn">import</span> <span class="nn">warnings</span>
<span class="n">warnings</span><span class="p">.</span><span class="n">simplefilter</span><span class="p">(</span><span class="n">action</span><span class="o">=</span><span class="s">'ignore'</span><span class="p">,</span> <span class="n">category</span><span class="o">=</span><span class="nb">FutureWarning</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>DATE</th>
<th>TRFVOLUSM227NFWA</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1970-01-01</td>
<td>80173.0</td>
</tr>
<tr>
<th>1</th>
<td>1970-02-01</td>
<td>77442.0</td>
</tr>
<tr>
<th>2</th>
<td>1970-03-01</td>
<td>90223.0</td>
</tr>
<tr>
<th>3</th>
<td>1970-04-01</td>
<td>89956.0</td>
</tr>
<tr>
<th>4</th>
<td>1970-05-01</td>
<td>97972.0</td>
</tr>
</tbody>
</table>
</div>
<p>Prophet is hard-coded to use specific column names; <code class="language-plaintext highlighter-rouge">ds</code> for dates and <code class="language-plaintext highlighter-rouge">y</code> for the target variable we want to predict.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Prophet requires column names to be 'ds' and 'y'
</span><span class="n">df</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'ds'</span><span class="p">,</span> <span class="s">'y'</span><span class="p">]</span>
<span class="c1"># 'ds' needs to be datetime object
</span><span class="n">df</span><span class="p">[</span><span class="s">'ds'</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'ds'</span><span class="p">])</span>
</code></pre></div></div>
<p>When plotting the original data, we can see there is a <strong>big, growing trend</strong> in the traffic volume, although there
seems to be some stagnant or even decreasing trends (<strong>change of rate</strong>) around 1980, 2008, and most strikingly, 2020
. Checking how Prophet can handle these changes would be interesting. There is also a <strong>seasonal, periodic trend</strong> that seems to repeat each year. It goes up until the middle of the year and goes down again. Will Prophet capture this as well?</p>
<div style="text-align:center">
<img src="/images/2020-12-15-prophet-intro/output_7_0.png" />
</div>
<p><br /></p>
<p>For train test split, do not forget that we cannot do a random split for time-series data. We use ONLY the earlier
part of data for training and the later part of data for testing given a cut-off point. Here, we use 2019/1/1 as our
cut-off point.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># split data
</span><span class="n">train</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">'ds'</span><span class="p">]</span> <span class="o"><</span> <span class="n">pd</span><span class="p">.</span><span class="n">Timestamp</span><span class="p">(</span><span class="s">'2019-01-01'</span><span class="p">)]</span>
<span class="n">test</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">'ds'</span><span class="p">]</span> <span class="o">>=</span> <span class="n">pd</span><span class="p">.</span><span class="n">Timestamp</span><span class="p">(</span><span class="s">'2019-01-01'</span><span class="p">)]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Number of months in train data: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">train</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Number of months in test data: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">test</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>
<pre class="output">
Number of months in train data: 588
Number of months in test data: 21
</pre>
<p><a id="train"></a></p>
<h1 id="2-train-and-predict">2. Train And Predict</h1>
<p>Let’s train a Prophet model. You just initialize an object and <code class="language-plaintext highlighter-rouge">fit</code>! That’s all.</p>
<p>Prophet warns that it disabled weekly and daily seasonality. That’s fine because our data set is monthly so there is no weekly or daily seasonality.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">fbprophet</span> <span class="kn">import</span> <span class="n">Prophet</span>
<span class="c1"># fit model - ignore train/test split for now
</span><span class="n">m</span> <span class="o">=</span> <span class="n">Prophet</span><span class="p">()</span>
<span class="n">m</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train</span><span class="p">)</span>
</code></pre></div></div>
<pre class="output">
INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
<fbprophet.forecaster.Prophet at 0x121b8dc88>
</pre>
<p>When making predictions with Prophet, we need to prepare a special object called future dataframe. It is a Pandas DataFrame with a single column <code class="language-plaintext highlighter-rouge">ds</code> that includes all datetime within the training data plus additional periods given by user.</p>
<p>The parameter <code class="language-plaintext highlighter-rouge">periods</code> is basically the number of points (rows) to predict after the end of the training data. The
interval (parameter <code class="language-plaintext highlighter-rouge">freq</code>) is set to ‘D’ (day) by default, so we need to adjust it to ‘MS’ (month start) as our data
is monthly. I set <code class="language-plaintext highlighter-rouge">periods=21</code> as it is the number of points in the test data.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># future dataframe - placeholder object
</span><span class="n">future</span> <span class="o">=</span> <span class="n">m</span><span class="p">.</span><span class="n">make_future_dataframe</span><span class="p">(</span><span class="n">periods</span><span class="o">=</span><span class="mi">21</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s">'MS'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># start of the future df is same as the original data
</span><span class="n">future</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>ds</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1970-01-01</td>
</tr>
<tr>
<th>1</th>
<td>1970-02-01</td>
</tr>
<tr>
<th>2</th>
<td>1970-03-01</td>
</tr>
<tr>
<th>3</th>
<td>1970-04-01</td>
</tr>
<tr>
<th>4</th>
<td>1970-05-01</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># end of the future df is original + 21 periods (21 months)
</span><span class="n">future</span><span class="p">.</span><span class="n">tail</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>ds</th>
</tr>
</thead>
<tbody>
<tr>
<th>604</th>
<td>2020-05-01</td>
</tr>
<tr>
<th>605</th>
<td>2020-06-01</td>
</tr>
<tr>
<th>606</th>
<td>2020-07-01</td>
</tr>
<tr>
<th>607</th>
<td>2020-08-01</td>
</tr>
<tr>
<th>608</th>
<td>2020-09-01</td>
</tr>
</tbody>
</table>
</div>
<p>It’s time to make actual predictions. It’s simple - just <code class="language-plaintext highlighter-rouge">predict</code> with the placeholder DataFrame <code class="language-plaintext highlighter-rouge">future</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># predict the future
</span><span class="n">forecast</span> <span class="o">=</span> <span class="n">m</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">future</span><span class="p">)</span>
</code></pre></div></div>
<p>Prophet has a nice built-in plotting function to visualize forecast data. Black dots are for actual data and the blue
line is prediction. You can also use matplotlib functions to adjust the figure, such as adding legend or adding
xlim or ylim.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Prophet's own plotting tool to see
</span><span class="n">fig</span> <span class="o">=</span> <span class="n">m</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">forecast</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">([</span><span class="s">'Actual'</span><span class="p">,</span> <span class="s">'Prediction'</span><span class="p">,</span> <span class="s">'Uncertainty interval'</span><span class="p">])</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div style="text-align:center">
<img src="/images/2020-12-15-prophet-intro/output_20_0.png" />
</div>
<p><br /></p>
<p><a id="components"></a></p>
<h1 id="3-check-components">3. Check Components</h1>
<p>So, what is in the forecast DataFrame? Let’s take a look.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">forecast</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>ds</th>
<th>trend</th>
<th>yhat_lower</th>
<th>yhat_upper</th>
<th>trend_lower</th>
<th>trend_upper</th>
<th>additive_terms</th>
<th>additive_terms_lower</th>
<th>additive_terms_upper</th>
<th>yearly</th>
<th>yearly_lower</th>
<th>yearly_upper</th>
<th>multiplicative_terms</th>
<th>multiplicative_terms_lower</th>
<th>multiplicative_terms_upper</th>
<th>yhat</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1970-01-01</td>
<td>94281.848744</td>
<td>69838.269924</td>
<td>81366.107613</td>
<td>94281.848744</td>
<td>94281.848744</td>
<td>-18700.514310</td>
<td>-18700.514310</td>
<td>-18700.514310</td>
<td>-18700.514310</td>
<td>-18700.514310</td>
<td>-18700.514310</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>75581.334434</td>
</tr>
<tr>
<th>1</th>
<td>1970-02-01</td>
<td>94590.609819</td>
<td>61661.016554</td>
<td>73066.758942</td>
<td>94590.609819</td>
<td>94590.609819</td>
<td>-27382.307301</td>
<td>-27382.307301</td>
<td>-27382.307301</td>
<td>-27382.307301</td>
<td>-27382.307301</td>
<td>-27382.307301</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>67208.302517</td>
</tr>
<tr>
<th>2</th>
<td>1970-03-01</td>
<td>94869.490789</td>
<td>89121.298723</td>
<td>99797.427717</td>
<td>94869.490789</td>
<td>94869.490789</td>
<td>37.306077</td>
<td>37.306077</td>
<td>37.306077</td>
<td>37.306077</td>
<td>37.306077</td>
<td>37.306077</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>94906.796867</td>
</tr>
<tr>
<th>3</th>
<td>1970-04-01</td>
<td>95178.251864</td>
<td>89987.904019</td>
<td>101154.016322</td>
<td>95178.251864</td>
<td>95178.251864</td>
<td>166.278079</td>
<td>166.278079</td>
<td>166.278079</td>
<td>166.278079</td>
<td>166.278079</td>
<td>166.278079</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>95344.529943</td>
</tr>
<tr>
<th>4</th>
<td>1970-05-01</td>
<td>95477.052904</td>
<td>99601.487207</td>
<td>110506.849617</td>
<td>95477.052904</td>
<td>95477.052904</td>
<td>9672.619044</td>
<td>9672.619044</td>
<td>9672.619044</td>
<td>9672.619044</td>
<td>9672.619044</td>
<td>9672.619044</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>105149.671948</td>
</tr>
</tbody>
</table>
</div>
<p>There are many components in it but the main thing that you would care about is <code class="language-plaintext highlighter-rouge">yhat</code> which has the final predictions. <code class="language-plaintext highlighter-rouge">_lower</code> and <code class="language-plaintext highlighter-rouge">_upper</code> flags are for uncertainty intervals.</p>
<ul>
<li>Final predictions: <code class="language-plaintext highlighter-rouge">yhat</code>, <code class="language-plaintext highlighter-rouge">yhat_lower</code>, and <code class="language-plaintext highlighter-rouge">yhat_upper</code></li>
</ul>
<p>Other columns are components that comprise the final prediction as we discussed in the introduction. Let’s compare Prophet’s components and what we see in our forecast DataFrame.</p>
\[y(t) = g(t) + s(t) + h(t) + e(t)\]
<ul>
<li>Growth ($g(t)$): <code class="language-plaintext highlighter-rouge">trend</code>, <code class="language-plaintext highlighter-rouge">trend_lower</code>, and <code class="language-plaintext highlighter-rouge">trend_upper</code></li>
<li>Seasonality ($s(t)$): <code class="language-plaintext highlighter-rouge">additive_terms</code>, <code class="language-plaintext highlighter-rouge">additive_terms_lower</code>, and <code class="language-plaintext highlighter-rouge">additive_terms_upper</code>
<ul>
<li>Yearly seasonality: <code class="language-plaintext highlighter-rouge">yearly</code>, <code class="language-plaintext highlighter-rouge">yearly_lower</code>, and<code class="language-plaintext highlighter-rouge">yearly_upper</code></li>
</ul>
</li>
</ul>
<p>The <code class="language-plaintext highlighter-rouge">additive_terms</code> represent the total seasonality effect, which is the same as yearly seasonality as we disabled weekly and daily seasonalities. All <code class="language-plaintext highlighter-rouge">multiplicative_terms</code> are zero because we used additive seasonality mode by default instead of multiplicative seasonality mode, which I will explain later.</p>
<p>Holiday effect ($h(t)$) is not present here as it’s yearly data.</p>
<p>Prophet also has a nice built-in function for plotting each component. When we plot our forecast data, we see two
components; general growth trend and yearly seasonality that appears throughout the years. If we had more components
such as weekly or daily seasonality, they would have been presented here as well.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># plot components
</span><span class="n">fig</span> <span class="o">=</span> <span class="n">m</span><span class="p">.</span><span class="n">plot_components</span><span class="p">(</span><span class="n">forecast</span><span class="p">)</span>
</code></pre></div></div>
<div style="text-align:center">
<img src="/images/2020-12-15-prophet-intro/output_25_0.png" />
</div>
<p><br /></p>
<p><a id="eval"></a></p>
<h1 id="4-evaluate">4. Evaluate</h1>
<h2 id="41-evaluate-the-model-on-one-test-set">4.1. Evaluate the model on one test set</h2>
<p>How good is our model? One way we can understand the model performance, in this case, is to simply calculate the
root mean squared error (RMSE) between the actual and predicted values of the above test period.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">statsmodels.tools.eval_measures</span> <span class="kn">import</span> <span class="n">rmse</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">forecast</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="o">-</span><span class="nb">len</span><span class="p">(</span><span class="n">test</span><span class="p">):][</span><span class="s">'yhat'</span><span class="p">]</span>
<span class="n">actuals</span> <span class="o">=</span> <span class="n">test</span><span class="p">[</span><span class="s">'y'</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"RMSE: </span><span class="si">{</span><span class="nb">round</span><span class="p">(</span><span class="n">rmse</span><span class="p">(</span><span class="n">predictions</span><span class="p">,</span> <span class="n">actuals</span><span class="p">))</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>
<pre class="output">
RMSE: 32969.0
</pre>
<p>However, this probably under-represents the general model performance because our data has a drastic change in the
middle of the test period which is a pattern that has never been seen before. If our data was until 2019, the model
performance score would have been much higher.</p>
<div style="text-align:center">
<img src="/images/2020-12-15-prophet-intro/output_31_0.png" />
</div>
<p><br /></p>
<h2 id="42-cross-validation">4.2. Cross validation</h2>
<p>Alternatively, we can perform cross-validation. As previously discussed, time-series analysis strictly uses train
data whose time range is earlier than that of test data. Below is an example where we use 5 years of train data to
predict 1-year of test data. Each cut-off point is equally spaced with 1 year gap.</p>
<div style="text-align:center">
<img src="/images/2020-12-15-prophet-intro/prophet_cv.png" alt="cv" />
<figcaption>Time-series cross validation
</figcaption>
</div>
<p><br /></p>
<p>Prophet also provides built-in model diagnostics tools to make it easy to perform this cross-validation. You just
need to define three parameters: horizon, initial, and period. The latter two are optional.</p>
<ul>
<li>horizon: test period of each fold</li>
<li>initial: minimum training period to start with</li>
<li>period: time gap between cut-off dates</li>
</ul>
<p>Make sure to define these parameters in string and in this format: ‘X unit’. X is the number and unit is ‘days’ or
‘secs’, etc. that is compatible with <code class="language-plaintext highlighter-rouge">pd.Timedelta</code>. For example, <code class="language-plaintext highlighter-rouge">10 days</code>.</p>
<p>You can also define <code class="language-plaintext highlighter-rouge">parallel</code> to make the cross validation faster.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">fbprophet.diagnostics</span> <span class="kn">import</span> <span class="n">cross_validation</span>
<span class="c1"># test period
</span><span class="n">horizon</span> <span class="o">=</span> <span class="s">'365 days'</span>
<span class="c1"># itraining period (optional. default is 3x of horizon)
</span><span class="n">initial</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="mi">365</span> <span class="o">*</span> <span class="mi">5</span><span class="p">)</span> <span class="o">+</span> <span class="s">' days'</span>
<span class="c1"># spacing between cutoff dates (optional. default is 0.5x of horizon)
</span><span class="n">period</span> <span class="o">=</span> <span class="s">'365 days'</span>
<span class="n">df_cv</span> <span class="o">=</span> <span class="n">cross_validation</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">initial</span><span class="o">=</span><span class="n">initial</span><span class="p">,</span> <span class="n">period</span><span class="o">=</span><span class="n">period</span><span class="p">,</span> <span class="n">horizon</span><span class="o">=</span><span class="n">horizon</span><span class="p">,</span> <span class="n">parallel</span><span class="o">=</span><span class="s">'processes'</span><span class="p">)</span>
</code></pre></div></div>
<pre class="output">
INFO:fbprophet:Making 43 forecasts with cutoffs between 1975-12-12 00:00:00 and 2017-12-01 00:00:00
INFO:fbprophet:Applying in parallel with <concurrent.futures.process.ProcessPoolExecutor object at 0x12fb4d3c8>
</pre>
<p>This is the predicted output using cross-validation. There can be many predictions for the same timestamp if <code class="language-plaintext highlighter-rouge">period</code>
is smaller than <code class="language-plaintext highlighter-rouge">horizon</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># predicted output using cross validation
</span><span class="n">df_cv</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>ds</th>
<th>yhat</th>
<th>yhat_lower</th>
<th>yhat_upper</th>
<th>y</th>
<th>cutoff</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1976-01-01</td>
<td>102282.737592</td>
<td>100862.769604</td>
<td>103589.684840</td>
<td>102460.0</td>
<td>1975-12-12</td>
</tr>
<tr>
<th>1</th>
<td>1976-02-01</td>
<td>96811.141761</td>
<td>95360.095284</td>
<td>98247.364027</td>
<td>98528.0</td>
<td>1975-12-12</td>
</tr>
<tr>
<th>2</th>
<td>1976-03-01</td>
<td>112360.483572</td>
<td>110908.136982</td>
<td>113775.264669</td>
<td>114284.0</td>
<td>1975-12-12</td>
</tr>
<tr>
<th>3</th>
<td>1976-04-01</td>
<td>112029.016859</td>
<td>110622.916037</td>
<td>113458.999123</td>
<td>117014.0</td>
<td>1975-12-12</td>
</tr>
<tr>
<th>4</th>
<td>1976-05-01</td>
<td>119161.998160</td>
<td>117645.653475</td>
<td>120579.267732</td>
<td>123278.0</td>
<td>1975-12-12</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>511</th>
<td>2018-08-01</td>
<td>279835.003826</td>
<td>274439.830747</td>
<td>285259.974314</td>
<td>284989.0</td>
<td>2017-12-01</td>
</tr>
<tr>
<th>512</th>
<td>2018-09-01</td>
<td>261911.246557</td>
<td>256328.677902</td>
<td>267687.122886</td>
<td>267434.0</td>
<td>2017-12-01</td>
</tr>
<tr>
<th>513</th>
<td>2018-10-01</td>
<td>268979.448383</td>
<td>263001.411543</td>
<td>274742.978202</td>
<td>281382.0</td>
<td>2017-12-01</td>
</tr>
<tr>
<th>514</th>
<td>2018-11-01</td>
<td>255612.520483</td>
<td>249813.339845</td>
<td>261179.979649</td>
<td>260473.0</td>
<td>2017-12-01</td>
</tr>
<tr>
<th>515</th>
<td>2018-12-01</td>
<td>257049.510224</td>
<td>251164.508448</td>
<td>263062.671327</td>
<td>270370.0</td>
<td>2017-12-01</td>
</tr>
</tbody>
</table>
<p>516 rows × 6 columns</p>
</div>
<p>Below are different performance metrics for different rolling windows. As we did not define any rolling window, Prophet
went ahead and calculated many different combinations and stacked them up in rows (e.g. 53 days, …, 365 days). Each
metric is first calculated within each rolling window and then averaged across many available windows.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">fbprophet.diagnostics</span> <span class="kn">import</span> <span class="n">cross_validation</span><span class="p">,</span> <span class="n">performance_metrics</span>
<span class="c1"># performance metrics
</span><span class="n">df_metrics</span> <span class="o">=</span> <span class="n">performance_metrics</span><span class="p">(</span><span class="n">df_cv</span><span class="p">)</span> <span class="c1"># can define window size, e.g. rolling_window=365
</span><span class="n">df_metrics</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>horizon</th>
<th>mse</th>
<th>rmse</th>
<th>mae</th>
<th>mape</th>
<th>mdape</th>
<th>coverage</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>53 days</td>
<td>3.886562e+07</td>
<td>6234.229883</td>
<td>5143.348348</td>
<td>0.030813</td>
<td>0.027799</td>
<td>0.352941</td>
</tr>
<tr>
<th>1</th>
<td>54 days</td>
<td>3.983610e+07</td>
<td>6311.584390</td>
<td>5172.484468</td>
<td>0.030702</td>
<td>0.027799</td>
<td>0.372549</td>
</tr>
<tr>
<th>2</th>
<td>55 days</td>
<td>4.272605e+07</td>
<td>6536.516453</td>
<td>5413.997433</td>
<td>0.031607</td>
<td>0.030305</td>
<td>0.352941</td>
</tr>
<tr>
<th>3</th>
<td>56 days</td>
<td>4.459609e+07</td>
<td>6678.030078</td>
<td>5662.344846</td>
<td>0.032630</td>
<td>0.031911</td>
<td>0.313725</td>
</tr>
<tr>
<th>4</th>
<td>57 days</td>
<td>4.341828e+07</td>
<td>6589.254589</td>
<td>5650.202377</td>
<td>0.032133</td>
<td>0.031481</td>
<td>0.313725</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>115</th>
<td>361 days</td>
<td>2.880647e+07</td>
<td>5367.165528</td>
<td>3960.025025</td>
<td>0.020118</td>
<td>0.015177</td>
<td>0.607843</td>
</tr>
<tr>
<th>116</th>
<td>362 days</td>
<td>3.158472e+07</td>
<td>5620.028791</td>
<td>4158.035261</td>
<td>0.020836</td>
<td>0.015177</td>
<td>0.588235</td>
</tr>
<tr>
<th>117</th>
<td>363 days</td>
<td>3.798731e+07</td>
<td>6163.384773</td>
<td>4603.360382</td>
<td>0.022653</td>
<td>0.017921</td>
<td>0.549020</td>
</tr>
<tr>
<th>118</th>
<td>364 days</td>
<td>4.615621e+07</td>
<td>6793.836092</td>
<td>4952.443173</td>
<td>0.023973</td>
<td>0.018660</td>
<td>0.529412</td>
</tr>
<tr>
<th>119</th>
<td>365 days</td>
<td>5.428934e+07</td>
<td>7368.129817</td>
<td>5262.131511</td>
<td>0.024816</td>
<td>0.018660</td>
<td>0.529412</td>
</tr>
</tbody>
</table>
<p>120 rows × 7 columns</p>
</div>
<p><a id="trend"></a></p>
<h1 id="5-trend-change-points">5. Trend Change Points</h1>
<p>Another interesting functionality of <code class="language-plaintext highlighter-rouge">Prophet</code> is <code class="language-plaintext highlighter-rouge">add_changepoints_to_plot</code>. As we discussed in the earlier sections, there are a couple of points where the growth rate changes. Prophet can find those points automatically and plot them!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">fbprophet.plot</span> <span class="kn">import</span> <span class="n">add_changepoints_to_plot</span>
<span class="c1"># plot change points
</span><span class="n">fig</span> <span class="o">=</span> <span class="n">m</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">forecast</span><span class="p">)</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">add_changepoints_to_plot</span><span class="p">(</span><span class="n">fig</span><span class="p">.</span><span class="n">gca</span><span class="p">(),</span> <span class="n">m</span><span class="p">,</span> <span class="n">forecast</span><span class="p">)</span>
</code></pre></div></div>
<div style="text-align:center">
<img src="/images/2020-12-15-prophet-intro/output_42_0.png" />
</div>
<p><br /></p>
<p><a id="season"></a></p>
<h1 id="6-seasonality-mode">6. Seasonality Mode</h1>
<p>The growth in trend can be additive (rate of change is linear) or multiplicative (rate changes over time). When you
see the original data, the amplitude of seasonality changes - smaller in the early years and
bigger in the later years. So, this would be a <code class="language-plaintext highlighter-rouge">multiplicative</code> growth case rather than an <code class="language-plaintext highlighter-rouge">additive</code> growth case. We
can adjust the <code class="language-plaintext highlighter-rouge">seasonality</code> parameter so we can take into account this effect.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># additive mode
</span><span class="n">m</span> <span class="o">=</span> <span class="n">Prophet</span><span class="p">(</span><span class="n">seasonality_mode</span><span class="o">=</span><span class="s">'additive'</span><span class="p">)</span>
<span class="c1"># multiplicative mode
</span><span class="n">m</span> <span class="o">=</span> <span class="n">Prophet</span><span class="p">(</span><span class="n">seasonality_mode</span><span class="o">=</span><span class="s">'multiplicative'</span><span class="p">)</span>
</code></pre></div></div>
<p>You can see that the blue lines (predictions) are more in line with the black dots (actuals) when in multiplicative
seasonality mode.</p>
<div style="text-align:center">
<img src="/images/2020-12-15-prophet-intro/output_44_1.png" />
</div>
<p><br /></p>
<div style="text-align:center">
<img src="/images/2020-12-15-prophet-intro/output_45_1.png" />
</div>
<p><br /></p>
<p><a id="save"></a></p>
<h1 id="7-saving-model">7. Saving Model</h1>
<p>We can also easily export and load the trained model as json.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">json</span>
<span class="kn">from</span> <span class="nn">fbprophet.serialize</span> <span class="kn">import</span> <span class="n">model_to_json</span><span class="p">,</span> <span class="n">model_from_json</span>
<span class="c1"># Save model
</span><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'serialized_model.json'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout</span><span class="p">:</span>
<span class="n">json</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">model_to_json</span><span class="p">(</span><span class="n">m</span><span class="p">),</span> <span class="n">fout</span><span class="p">)</span>
<span class="c1"># Load model
</span><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'serialized_model.json'</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin</span><span class="p">:</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">model_from_json</span><span class="p">(</span><span class="n">json</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">fin</span><span class="p">))</span>
</code></pre></div></div>
<p><a id="ref"></a></p>
<h1 id="8-references">8. References</h1>
<ul>
<li><a href="https://facebook.github.io/prophet/docs/quick_start.html#python-api">Prophet documentation</a></li>
<li><a href="https://github.com/facebook/prophet">Prophet GitHub repository</a></li>
<li><a href="https://peerj.com/preprints/3190/">Prophet paper: Forecasting at scale</a></li>
<li><a href="https://cran.r-project.org/web/packages/prophet/prophet.pdf">Prophet in R</a></li>
<li><a href="https://fred.stlouisfed.org/series/TRFVOLUSM227NFWA">U.S. traffic volume data</a></li>
<li><a href="https://www.udemy.com/course/python-for-time-series-data-analysis/">Python for Time Series Data Analysis</a></li>
</ul>Recently, I came across a few articles mentioning Facebook’s Prophet library that looked interesting (although the initial release was almost 3 years ago!), so I decided to dig more into it.Getting “Failed to build gem native extension” Error After Upgrading to Mac OS Big Sur2020-12-14T05:00:00+00:002020-12-14T05:00:00+00:00/2020/12/14/ruby-big-sur<h2 id="problem">Problem</h2>
<p>After upgrading to Mac OS Big Sur, I saw this error I haven’t seen before:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>➜ bundle <span class="nb">exec </span>jekyll serve
Could not find commonmarker-0.17.13 <span class="k">in </span>any of the sources
Run <span class="sb">`</span>bundle <span class="nb">install</span><span class="sb">`</span> to <span class="nb">install </span>missing gems.
</code></pre></div></div>
<p>So I ran <code class="language-plaintext highlighter-rouge">bundle install</code> but it didn’t seem to work either. An excerpt from the terminal output:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>➜ bundle <span class="nb">install
</span>Gem::Ext::BuildError: ERROR: Failed to build gem native extension.
An error occurred <span class="k">while </span>installing commonmarker <span class="o">(</span>0.17.13<span class="o">)</span>, and Bundler cannot <span class="k">continue</span><span class="nb">.</span>
Make sure that <span class="sb">`</span>gem <span class="nb">install </span>commonmarker <span class="nt">-v</span> <span class="s1">'0.17.13'</span> <span class="nt">--source</span> <span class="s1">'https://rubygems.org/'</span><span class="sb">`</span> succeeds before bundling.
An error occurred <span class="k">while </span>installing unf_ext <span class="o">(</span>0.0.7.7<span class="o">)</span>, and Bundler cannot <span class="k">continue</span><span class="nb">.</span>
Make sure that <span class="sb">`</span>gem <span class="nb">install </span>unf_ext <span class="nt">-v</span> <span class="s1">'0.0.7.7'</span> <span class="nt">--source</span> <span class="s1">'https://rubygems.org/'</span><span class="sb">`</span> succeeds before bundling.
An error occurred <span class="k">while </span>installing rdiscount <span class="o">(</span>2.2.0.2<span class="o">)</span>, and Bundler cannot <span class="k">continue</span><span class="nb">.</span>
Make sure that <span class="sb">`</span>gem <span class="nb">install </span>rdiscount <span class="nt">-v</span> <span class="s1">'2.2.0.2'</span> <span class="nt">--source</span> <span class="s1">'https://rubygems.org/'</span><span class="sb">`</span> succeeds before bundling.
</code></pre></div></div>
<h2 id="solution">Solution</h2>
<p>After some research and trial and error, I found <a href="https://stackoverflow
.com/a/65017115/9449085">this precious answer</a> in Stack Overflow that it is due to the Ruby version that is not compatible with Big
Sur and it should be at least 2.7. So I
checked <a href="https://wwws
.ruby-lang
.org/en/downloads/releases/">Ruby releases</a> and decided to go with one of the most recent releases: 2.7.2.</p>
<p>Anyway, steps I think worked:</p>
<ol>
<li>Check Ruby version
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ➜ ruby <span class="nt">-v</span>
ruby 2.6.3p62 <span class="o">(</span>2019-04-16 revision 67580<span class="o">)</span> <span class="o">[</span>universal.x86_64-darwin20]
</code></pre></div> </div>
</li>
<li>Install Ruby Version Manager (rvm)
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ➜ curl <span class="nt">-sSL</span> https://raw.githubusercontent.com/rvm/rvm/master/binscripts/rvm-installer | bash <span class="nt">-s</span> stable
</code></pre></div> </div>
</li>
<li>Install 2.7.2 version using rvm
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ➜ rvm <span class="nb">install</span> <span class="s2">"ruby-2.7.2"</span>
</code></pre></div> </div>
</li>
<li>Check Ruby version again
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ➜ ruby <span class="nt">-v</span>
ruby 2.7.2p137 <span class="o">(</span>2020-10-01 revision 5445e04352<span class="o">)</span> <span class="o">[</span>x86_64-darwin20]
</code></pre></div> </div>
</li>
<li>Bundle install
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ➜ bundle <span class="nb">install</span>
</code></pre></div> </div>
</li>
<li>Run bundle
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ➜ bundle <span class="nb">exec </span>jekyll serve
</code></pre></div> </div>
</li>
</ol>
<p>And it worked!</p>
<h2 id="some-other-things-i-tried">Some other things I tried</h2>
<ol>
<li>
<p>Installing Ruby through Homebrew: didn’t solve the issue but I don’t know if this actually and eventually helped or
not.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ➜ brew <span class="nb">install </span>ruby
</code></pre></div> </div>
</li>
<li>
<p>Installing the latest version of Ruby <a href="https://stackoverflow
.com/a/38194139/9449085">(ref)</a> using rvm: didn’t update the version
for some reason.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ➜ rvm <span class="nb">install </span>ruby@latest
</code></pre></div> </div>
</li>
</ol>ProblemFinding The Best Feature Engineering Strategy Using sklearn GridSearchCV2020-12-06T03:49:00+00:002020-12-06T03:49:00+00:00/2020/12/06/missing-imputation-with-gridsearchcv<p><br /></p>
<div style="text-align:center">
<img src="/images/search.jpg" alt="search" />
<figcaption>Photo by <a href="https://unsplash.com/@laughayette?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Marten Newhall</a> on <a href="https://unsplash.com/s/photos/search?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a></figcaption>
</div>
<p><br /></p>
<p>We previously reviewed a few missing data imputation strategies using sklearn in <a href="/python/2020/11/21/Missing-data-imputation-using-sklearn.html">this post</a>, but which one should we use? How do we know which one works best for our data? Should we manually write a script to fit a model for each strategy and track the model performance? We could, but it would be a headache to track many different models, especially if we use cross validation to get more reliable experiment results.</p>
<p>Fortunately, sklearn offers great tools to streamline and optimize the process, which are <code class="language-plaintext highlighter-rouge">GridSearchCV</code> and <code class="language-plaintext highlighter-rouge">Pipeline</code>! You might be already familiar with using <code class="language-plaintext highlighter-rouge">GridSearchCV</code> for finding optimal hyperparameters of a model, but you might not be familiar with using it for finding optimal feature engineering strategies.</p>
<p>In this post, I would like to walk you through how <code class="language-plaintext highlighter-rouge">GridSearchCV</code> and <code class="language-plaintext highlighter-rouge">Pipeline</code> can be used to find the best feature engineering strategies for the given data. We will focus on missing data imputation strategies here but it can be used for any other feature engineering steps or combinations.</p>
<h1 id="table-of-conents">Table of Conents</h1>
<ol>
<li><a href="#prep">Prepare Data</a></li>
<li><a href="#base">Setup a Base Pipeline</a></li>
<li><a href="#find_best">Finding The Best Imputation Technique Using GridSearchCV</a></li>
<li><a href="#ref">References</a></li>
</ol>
<p><a id="prep"></a></p>
<h1 id="1-prepare-data">1. Prepare Data</h1>
<p>First, import necessary libraries and prepare data. We will use the <a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data">house price data from Kaggle</a> in this post.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="c1"># preparing data
</span><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="c1"># feature scaling, encoding
</span><span class="kn">from</span> <span class="nn">sklearn.impute</span> <span class="kn">import</span> <span class="n">SimpleImputer</span>
<span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">StandardScaler</span><span class="p">,</span> <span class="n">OneHotEncoder</span>
<span class="c1"># putting together in pipeline
</span><span class="kn">from</span> <span class="nn">sklearn.pipeline</span> <span class="kn">import</span> <span class="n">Pipeline</span>
<span class="kn">from</span> <span class="nn">sklearn.compose</span> <span class="kn">import</span> <span class="n">ColumnTransformer</span>
<span class="c1"># model selection
</span><span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">Lasso</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">cross_val_score</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">GridSearchCV</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># import house price data
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'../data/house_price/train.csv'</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="s">'Id'</span><span class="p">)</span>
<span class="c1"># find numerical columns vs. categorical columns, except for the target ('SalePrice')
</span><span class="n">num_cols</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'SalePrice'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">select_dtypes</span><span class="p">(</span><span class="s">'number'</span><span class="p">).</span><span class="n">columns</span>
<span class="n">cat_cols</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'SalePrice'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">select_dtypes</span><span class="p">(</span><span class="s">'object'</span><span class="p">).</span><span class="n">columns</span>
<span class="c1"># define X and y for GridSearchCV
</span><span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'SalePrice'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'SalePrice'</span><span class="p">]</span>
<span class="c1"># split train and test dataset
</span><span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'SalePrice'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
<span class="n">df</span><span class="p">[</span><span class="s">'SalePrice'</span><span class="p">],</span>
<span class="n">test_size</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span>
<span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>
<p><a id="base"></a></p>
<h1 id="2-setup-a-base-pipeline">2. Setup a Base Pipeline</h1>
<h2 id="21-define-pipelines">2.1. Define Pipelines</h2>
<p>The next step is defining a base <code class="language-plaintext highlighter-rouge">Pipeline</code> for our model as below.</p>
<ol>
<li>
<p>Define two feature preprocessing pipelines; one for numerical variables (<code class="language-plaintext highlighter-rouge">num_pipe</code>) and the other for categorical variables (<code class="language-plaintext highlighter-rouge">cat_pipe</code>). <code class="language-plaintext highlighter-rouge">num_pipe</code> has <code class="language-plaintext highlighter-rouge">SimpleImputer</code> for missing data imputation and <code class="language-plaintext highlighter-rouge">StandardScaler</code> for scaling data. <code class="language-plaintext highlighter-rouge">cat_pipe</code> has <code class="language-plaintext highlighter-rouge">SimpleImputer</code> for missing data imputation and <code class="language-plaintext highlighter-rouge">OneHotEncoder</code> for encoding categorical data as numerical data.</p>
</li>
<li>
<p>Combine those two pipelines together using <code class="language-plaintext highlighter-rouge">ColumnTransformer</code> to apply them to a different set of columns.</p>
</li>
<li>
<p>Define the final pipeline called <code class="language-plaintext highlighter-rouge">pipe</code> by putting the <code class="language-plaintext highlighter-rouge">preprocess</code> pipeline together with an estimator, which is <code class="language-plaintext highlighter-rouge">Lasso</code> regression in this example.</p>
</li>
</ol>
<p>For details of this pipeline, please check out the previous post <a href="/python/2020/11/28/pipeline_columntransformer.html">Combining Feature Engineering and Model Fitting
(Pipeline vs. ColumnTransformer)</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># feature engineering pipeline for numerical variables
</span><span class="n">num_pipe</span><span class="o">=</span> <span class="n">Pipeline</span><span class="p">([(</span><span class="s">'imputer'</span><span class="p">,</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'mean'</span><span class="p">,</span> <span class="n">add_indicator</span><span class="o">=</span><span class="bp">False</span><span class="p">)),</span>
<span class="p">(</span><span class="s">'scaler'</span><span class="p">,</span> <span class="n">StandardScaler</span><span class="p">())])</span>
<span class="c1"># feature engineering pipeline for categorical variables
# Note: fill_value='Missing' is not used for strategy='most_frequent' but defined here for later use
</span><span class="n">cat_pipe</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">([(</span><span class="s">'imputer'</span><span class="p">,</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'most_frequent'</span><span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="s">'Missing'</span><span class="p">)),</span>
<span class="p">(</span><span class="s">'encoder'</span><span class="p">,</span> <span class="n">OneHotEncoder</span><span class="p">(</span><span class="n">handle_unknown</span><span class="o">=</span><span class="s">'ignore'</span><span class="p">))])</span>
<span class="c1"># put numerical and categorical feature engineering pipelines together
</span><span class="n">preprocess</span> <span class="o">=</span> <span class="n">ColumnTransformer</span><span class="p">([(</span><span class="s">"num_pipe"</span><span class="p">,</span> <span class="n">num_pipe</span><span class="p">,</span> <span class="n">num_cols</span><span class="p">),</span>
<span class="p">(</span><span class="s">"cat_pipe"</span><span class="p">,</span> <span class="n">cat_pipe</span><span class="p">,</span> <span class="n">cat_cols</span><span class="p">)])</span>
<span class="c1"># put transformers and an estimator together
</span><span class="n">pipe</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">([(</span><span class="s">'preprocess'</span><span class="p">,</span> <span class="n">preprocess</span><span class="p">),</span>
<span class="p">(</span><span class="s">'lasso'</span><span class="p">,</span> <span class="n">Lasso</span><span class="p">(</span><span class="n">max_iter</span><span class="o">=</span><span class="mi">10000</span><span class="p">))])</span>
</code></pre></div></div>
<h2 id="22-fit-pipeline">2.2. Fit Pipeline</h2>
<p>Okay, so let’s fit the model with our train data and test with the test data. Here, we get 0.63 for the score, which is $R^2$ of the prediction in this case (<a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso.score">sklearn Lasso</a>).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># fit model
</span><span class="n">pipe</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">pipe</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span>
</code></pre></div></div>
<pre class="output">
0.6308258188969262
</pre>
<p>We could also cross validate the model using <code class="language-plaintext highlighter-rouge">cross_val_score</code>. It splits the whole data into 5 sets and calculates the score 5 times by fitting and testing with different sets each time.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># cross validate
</span><span class="n">cross_val_score</span><span class="p">(</span><span class="n">pipe</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div></div>
<pre class="output">
array([0.85570392, 0.8228412 , 0.80381056, 0.88846653, 0.63236809])
</pre>
<h2 id="23-diffferent-parameters-to-test">2.3. Diffferent Parameters To Test</h2>
<p>Let’s say we want to try different combinations of missing data imputation strategies for <code class="language-plaintext highlighter-rouge">SimpleImputer</code>, such as both <code class="language-plaintext highlighter-rouge">'median'</code> and <code class="language-plaintext highlighter-rouge">'mean'</code> for <code class="language-plaintext highlighter-rouge">strategy</code> and both <code class="language-plaintext highlighter-rouge">True</code> and <code class="language-plaintext highlighter-rouge">False</code> for <code class="language-plaintext highlighter-rouge">add_indicator</code>. To compare all of the cases, we need to test 4 different models with the following numerical variable imputation methods:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SimpleImputer(strategy='mean', add_indicator=False)
SimpleImputer(strategy='median', add_indicator=False)
SimpleImputer(strategy='mean', add_indicator=True)
SimpleImputer(strategy='median', add_indicator=True)
</code></pre></div></div>
<p>We could copy and paste the script we wrote above, replace the corresponding step, and compare the performance of each case. It would not be too bad for the 4 combinations. But what if we want to test more combinations such as <code class="language-plaintext highlighter-rouge">strategy='constant'</code> and <code class="language-plaintext highlighter-rouge">strategy='most_frequent'</code> for categorical variables? Now it becomes 8 combinations ($2 \times 2 \times 2 = 8$).</p>
<p>The more parameters we add, the more cases we have to test ad track (exponentially growing cases!). But don’t worry! We have <code class="language-plaintext highlighter-rouge">GridSearchCV</code>.</p>
<p><a id="find_best"></a></p>
<h1 id="3-finding-the-best-imputation-technique-using-gridsearchcv">3. Finding The Best Imputation Technique Using GridSearchCV</h1>
<h2 id="31-what-is-gridsearchcv">3.1. What Is GridSearchCV?</h2>
<p><code class="language-plaintext highlighter-rouge">GridSearchCV</code> is a sklearn class that is used to find parameters with the best cross validation given the search space (parameter combinations). This can be used not only for hyperparameter tuning for estimators (e.g. <code class="language-plaintext highlighter-rouge">alpha</code> for Lasso), but also for parameters in any preprocessing step. We just need to define parameters that we want to optimize and pass them to <code class="language-plaintext highlighter-rouge">GridSearchCV()</code> as a dictionary.</p>
<p>The rule for defining the grid search parameter key-value pair is the following:</p>
<ol>
<li>Key: a string that combines the name of the step with the name of the parameter with two understcores</li>
<li>Value: a list of parameter values to test</li>
</ol>
<p>In short, it’s <code class="language-plaintext highlighter-rouge">{'step_name__parameter_name': a list of values}</code>. For example, if the step name is <code class="language-plaintext highlighter-rouge">lasso</code> and the parameter name is <code class="language-plaintext highlighter-rouge">alpha</code>, your grid search param becomes:
<code class="language-plaintext highlighter-rouge">{'lasso__alph': [1, 5, 10]}</code></p>
<h2 id="32-defining-nested-parameters">3.2. Defining Nested Parameters</h2>
<p>What about nested parameters that we have in our case? For example, our missing data imputation strategy for numerical variables is a few steps away from the final pipeline such as <code class="language-plaintext highlighter-rouge">preprocess</code> –> <code class="language-plaintext highlighter-rouge">num_pipe</code> –> <code class="language-plaintext highlighter-rouge">imputer</code>.</p>
<p>Even for those cases, we can simply expand the key by keeping combining them with two unerstcore:</p>
<p><code class="language-plaintext highlighter-rouge">{'preprocess__num_pipe__imputer__strategy': ['mean', 'median', 'most_frequent']}</code></p>
<h2 id="33-defining-and-fitting-gridsearchcv">3.3. Defining and Fitting GridSearchCV</h2>
<p>With the basics of <code class="language-plaintext highlighter-rouge">GridSearchCV</code>, let’s define <code class="language-plaintext highlighter-rouge">GridSearchCV</code> and its parameters for our problem.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># define the GridSearchCV parameters
</span><span class="n">param</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">preprocess__num_pipe__imputer__strategy</span><span class="o">=</span><span class="p">[</span><span class="s">'mean'</span><span class="p">,</span> <span class="s">'median'</span><span class="p">,</span> <span class="s">'most_frequent'</span><span class="p">],</span>
<span class="n">preprocess__num_pipe__imputer__add_indicator</span><span class="o">=</span><span class="p">[</span><span class="bp">True</span><span class="p">,</span> <span class="bp">False</span><span class="p">],</span>
<span class="n">preprocess__cat_pipe__imputer__strategy</span><span class="o">=</span><span class="p">[</span><span class="s">'most_frequent'</span><span class="p">,</span> <span class="s">'constant'</span><span class="p">])</span>
<span class="c1"># define GridSearchCV
</span><span class="n">grid_search</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">pipe</span><span class="p">,</span> <span class="n">param</span><span class="p">)</span>
</code></pre></div></div>
<p>Now it’s time to find the best parameters by simply running <code class="language-plaintext highlighter-rouge">fit</code>!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># search the best parameters by fitting the GridSearchCV
</span><span class="n">grid_search</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="34-checking-the-results">3.4. Checking the results</h2>
<p>To check the combinations of parameters we tested and their performances in each cross validation set in terms of score and time, we can use the attribute <code class="language-plaintext highlighter-rouge">.cv_results</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># check out the results
</span><span class="n">grid_search</span><span class="p">.</span><span class="n">cv_results_</span>
</code></pre></div></div>
<p>So, which model did the <code class="language-plaintext highlighter-rouge">GridSearchCV</code> find to be most effective and what’s its score? Let’s check out <code class="language-plaintext highlighter-rouge">.bast_params_</code> and <code class="language-plaintext highlighter-rouge">.best_score_</code> attributes for that.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># check out the best parameter combination found
</span><span class="n">grid_search</span><span class="p">.</span><span class="n">best_params_</span>
</code></pre></div></div>
<pre class="output">
{'preprocess__cat_pipe__imputer__strategy': 'constant',
'preprocess__num_pipe__imputer__add_indicator': False,
'preprocess__num_pipe__imputer__strategy': 'most_frequent'}
</pre>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># score
</span><span class="n">grid_search</span><span class="p">.</span><span class="n">best_score_</span>
</code></pre></div></div>
<pre class="output">
0.8058139542143075
</pre>
<p>Awesome! It seems like using a <code class="language-plaintext highlighter-rouge">constant</code> value for categorical variables and <code class="language-plaintext highlighter-rouge">most_frequent</code> values for numerical
variables without missing indicator was found to be most effective in this case. Again, the best missing data imputation strategy depends on the data and the model. Try out with your data and see what works best for yours!</p>
<p><a id="ref"></a></p>
<h1 id="4-references">4. References</h1>
<ul>
<li><a href="'https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html'">sklearn GridSearchCV</a></li>
<li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">sklearn Pipeline</a></li>
<li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html">sklearn ColumnTransformer</a></li>
<li><a href="https://www.udemy.com/course/feature-engineering-for-machine-learning/">Feature Engineering for Machine Learning</a></li>
</ul>Photo by Marten Newhall on UnsplashCombining Feature Engineering and Model Fitting (Pipeline vs. ColumnTransformer)2020-11-28T04:00:00+00:002020-11-28T04:00:00+00:00/python/2020/11/28/pipeline_columntransformer<p><br /></p>
<div style="text-align:center">
<img src="/images/pipeline_columntransformer/pipeline_unsplash.jpg" alt="drawing" />
<figcaption>Photo by <a href="https://unsplash.com/@spacexuan?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Crystal Kwok</a> on <a href="https://unsplash.com/s/photos/pipes?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a></figcaption>
</div>
<p><br /></p>
<p>In the <a href="/python/2020/11/21/Missing-data-imputation-using-sklearn.html">previous post</a>, we learned about
various missing data imputation strategies using
scikit-learn. Before diving
into finding the best imputation method for a given problem, I would like to first introduce two scikit-learn
classes, <code class="language-plaintext highlighter-rouge">Pipeline</code> and <code class="language-plaintext highlighter-rouge">ColumnTransformer</code>.</p>
<p>Both <code class="language-plaintext highlighter-rouge">Pipeline</code> amd <code class="language-plaintext highlighter-rouge">ColumnTransformer</code> are used to combine different transformers (i.e. feature engineering steps such
as
<code class="language-plaintext highlighter-rouge">SimpleImputer</code> and <code class="language-plaintext highlighter-rouge">OneHotEncoder</code>) to transform data. However, there are two major differences between them:</p>
<p><strong>1. <code class="language-plaintext highlighter-rouge">Pipeline</code> can be used for both/either of transformer and estimator (model) vs. <code class="language-plaintext highlighter-rouge">ColumnTransformer</code> is only for
transformers</strong> <br />
<strong>2. <code class="language-plaintext highlighter-rouge">Pipeline</code> is sequential vs. <code class="language-plaintext highlighter-rouge">ColumnTransformer</code> is parallel/independent</strong></p>
<p>Don’t worry if this sounds too complicated! I will walk you through what I mean by the above statements with code examples. I had a lot of fun while digging into these two classes, so I hope you enjoy and find it useful at the end as well!</p>
<h1 id="table-of-conents">Table of Conents</h1>
<ol>
<li><a href="#prep">Prepare Data</a></li>
<li><a href="#pipeline">Put Transformers and an Estimator Together: Pipeline</a></li>
<li><a href="#columntransformer">Apply Transformers to Different Columns: ColumnTransformer</a></li>
<li><a href="#separate">Separate Feature Engineering Pipelines for Numerical and Categorical Variables</a></li>
<li><a href="#final">Final Pipeline</a></li>
<li><a href="#summary">Summary</a></li>
<li><a href="#references">References</a></li>
</ol>
<p><a id="prep"></a></p>
<h1 id="0-prepare-data">0. Prepare Data</h1>
<p>Let’s first prepare the <a href="https://www.kaggle
.com/c/house-prices-advanced-regression-techniques/data">house price data from Kaggle</a> we will be using in this post. The data is preprocessed by
replacing <code class="language-plaintext highlighter-rouge">'?'</code> with <code class="language-plaintext highlighter-rouge">NaN</code>. Do not forget to split the data into train and test sets before performing any feature engineering steps to avoid data leakage!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="c1"># preparing data
</span><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="c1"># feature engineering: imputation, scaling, encoding
</span><span class="kn">from</span> <span class="nn">sklearn.impute</span> <span class="kn">import</span> <span class="n">SimpleImputer</span>
<span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">StandardScaler</span><span class="p">,</span> <span class="n">OneHotEncoder</span>
<span class="c1"># putting together in pipeline
</span><span class="kn">from</span> <span class="nn">sklearn.pipeline</span> <span class="kn">import</span> <span class="n">Pipeline</span>
<span class="kn">from</span> <span class="nn">sklearn.compose</span> <span class="kn">import</span> <span class="n">ColumnTransformer</span>
<span class="c1"># model to use
</span><span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">Lasso</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># import house price data
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'../data/house_price/train.csv'</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="s">'Id'</span><span class="p">)</span>
<span class="c1"># numerical columns vs. categorical columns
</span><span class="n">num_cols</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'SalePrice'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">select_dtypes</span><span class="p">(</span><span class="s">'number'</span><span class="p">).</span><span class="n">columns</span>
<span class="n">cat_cols</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'SalePrice'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">select_dtypes</span><span class="p">(</span><span class="s">'object'</span><span class="p">).</span><span class="n">columns</span>
<span class="c1"># split train and test dataset
</span><span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'SalePrice'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
<span class="n">df</span><span class="p">[</span><span class="s">'SalePrice'</span><span class="p">],</span>
<span class="n">test_size</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span>
<span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="c1"># check the size of train and test data
</span><span class="n">X_train</span><span class="p">.</span><span class="n">shape</span><span class="p">,</span> <span class="n">X_test</span><span class="p">.</span><span class="n">shape</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>((1022, 79), (438, 79))
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X_train</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div class="table-wrapper">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>MSSubClass</th>
<th>MSZoning</th>
<th>LotFrontage</th>
<th>LotArea</th>
<th>Street</th>
<th>Alley</th>
<th>LotShape</th>
<th>LandContour</th>
<th>Utilities</th>
<th>LotConfig</th>
<th>...</th>
<th>ScreenPorch</th>
<th>PoolArea</th>
<th>PoolQC</th>
<th>Fence</th>
<th>MiscFeature</th>
<th>MiscVal</th>
<th>MoSold</th>
<th>YrSold</th>
<th>SaleType</th>
<th>SaleCondition</th>
</tr>
<tr>
<th>Id</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>65</th>
<td>60</td>
<td>RL</td>
<td>NaN</td>
<td>9375</td>
<td>Pave</td>
<td>NaN</td>
<td>Reg</td>
<td>Lvl</td>
<td>AllPub</td>
<td>Inside</td>
<td>...</td>
<td>0</td>
<td>0</td>
<td>NaN</td>
<td>GdPrv</td>
<td>NaN</td>
<td>0</td>
<td>2</td>
<td>2009</td>
<td>WD</td>
<td>Normal</td>
</tr>
<tr>
<th>683</th>
<td>120</td>
<td>RL</td>
<td>NaN</td>
<td>2887</td>
<td>Pave</td>
<td>NaN</td>
<td>Reg</td>
<td>HLS</td>
<td>AllPub</td>
<td>Inside</td>
<td>...</td>
<td>0</td>
<td>0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>0</td>
<td>11</td>
<td>2008</td>
<td>WD</td>
<td>Normal</td>
</tr>
<tr>
<th>961</th>
<td>20</td>
<td>RL</td>
<td>50.0</td>
<td>7207</td>
<td>Pave</td>
<td>NaN</td>
<td>IR1</td>
<td>Lvl</td>
<td>AllPub</td>
<td>Inside</td>
<td>...</td>
<td>0</td>
<td>0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>0</td>
<td>2</td>
<td>2010</td>
<td>WD</td>
<td>Normal</td>
</tr>
<tr>
<th>1385</th>
<td>50</td>
<td>RL</td>
<td>60.0</td>
<td>9060</td>
<td>Pave</td>
<td>NaN</td>
<td>Reg</td>
<td>Lvl</td>
<td>AllPub</td>
<td>Inside</td>
<td>...</td>
<td>0</td>
<td>0</td>
<td>NaN</td>
<td>MnPrv</td>
<td>NaN</td>
<td>0</td>
<td>10</td>
<td>2009</td>
<td>WD</td>
<td>Normal</td>
</tr>
<tr>
<th>1101</th>
<td>30</td>
<td>RL</td>
<td>60.0</td>
<td>8400</td>
<td>Pave</td>
<td>NaN</td>
<td>Reg</td>
<td>Bnk</td>
<td>AllPub</td>
<td>Inside</td>
<td>...</td>
<td>0</td>
<td>0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>0</td>
<td>1</td>
<td>2009</td>
<td>WD</td>
<td>Normal</td>
</tr>
</tbody>
</table>
<p>5 rows × 79 columns</p>
</div>
</div>
<p><a id="pipeline"></a></p>
<h1 id="1-put-transformers-and-an-estimator-together-pipeline">1. Put Transformers and an Estimator Together: Pipeline</h1>
<p>Let’s say we want to train a Lasso regression model that predicts <code class="language-plaintext highlighter-rouge">SalePrice</code>. Instead of using all of the 79 variables we have, let’s use only numerical variables this time.</p>
<p>I already know there is plenty of missing data in some columns (e.g. <code class="language-plaintext highlighter-rouge">LotFrontage</code>, <code class="language-plaintext highlighter-rouge">MasVnrArea</code>, and <code class="language-plaintext highlighter-rouge">GarageYrBlt</code> among numerical columns), so we want to perform missing data imputation before fitting a model. Also, let’s say we also want to scale the data using <code class="language-plaintext highlighter-rouge">StandardScaler</code> because the scale of variables is all different.</p>
<p>This is what we would do normally to fit a model:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># take only numerical data
</span><span class="n">X_temp</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">[</span><span class="n">num_cols</span><span class="p">].</span><span class="n">copy</span><span class="p">()</span>
<span class="c1"># missing data imputation
</span><span class="n">imputer</span> <span class="o">=</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'mean'</span><span class="p">)</span>
<span class="n">X_impute</span> <span class="o">=</span> <span class="n">imputer</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X_temp</span><span class="p">)</span> <span class="c1"># np.ndarray
</span><span class="n">X_impute</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_impute</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">X_temp</span><span class="p">.</span><span class="n">columns</span><span class="p">)</span> <span class="c1"># pd.DataFrame
</span>
<span class="c1"># scale data
</span><span class="n">scaler</span> <span class="o">=</span> <span class="n">StandardScaler</span><span class="p">()</span>
<span class="n">X_scale</span> <span class="o">=</span> <span class="n">scaler</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X_impute</span><span class="p">)</span> <span class="c1"># np.ndarray
</span><span class="n">X_scale</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_scale</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">X_temp</span><span class="p">.</span><span class="n">columns</span><span class="p">)</span> <span class="c1"># pd.DataFrame
</span>
<span class="c1"># fit model
</span><span class="n">lasso</span> <span class="o">=</span> <span class="n">Lasso</span><span class="p">()</span>
<span class="n">lasso</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_scale</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">lasso</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_scale</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.8419801151434141
</code></pre></div></div>
<p>This is great but we have to manually move data from one step to another: we pass the output of the first step (<code class="language-plaintext highlighter-rouge">SimpleImputer</code>) to the second step (<code class="language-plaintext highlighter-rouge">StandardScaler</code>) as an input (<code class="language-plaintext highlighter-rouge">X_impute</code>). And then, the output of the second step (<code class="language-plaintext highlighter-rouge">StandardScaler</code>) is passed to the third step (<code class="language-plaintext highlighter-rouge">Lasso</code>) as an input (<code class="language-plaintext highlighter-rouge">X_scale</code>). If we have more feature engineering steps, it will be more complex to handle different inputs and outputs. So, here <code class="language-plaintext highlighter-rouge">Pipeline</code> comes to the rescue!</p>
<p><strong>With <code class="language-plaintext highlighter-rouge">Pipeline</code>, you can combine transformers and an estimator (model) together</strong>. You can transform your data and then fit a model with the transformed data. You just need to pass a list of tuples defining the steps in order: (step_name, transformer or estimator object). Let’s rewrite the same logic using <code class="language-plaintext highlighter-rouge">Pipeline</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># define feature engineering and model together
</span><span class="n">pipe</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">([(</span><span class="s">'imputer'</span><span class="p">,</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'mean'</span><span class="p">)),</span>
<span class="p">(</span><span class="s">'scaler'</span><span class="p">,</span> <span class="n">StandardScaler</span><span class="p">()),</span>
<span class="p">(</span><span class="s">'lasso'</span><span class="p">,</span> <span class="n">Lasso</span><span class="p">())])</span>
<span class="c1"># fit model
</span><span class="n">pipe</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_temp</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">pipe</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_temp</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.8419801151434141
</code></pre></div></div>
<p>Awesome! We saved a lot of lines and it looks much cleaner and more understandable! As you can see, <strong>Pipeline passes
the
first step’s output to the next step as its input, meaning Pipeline is sequential</strong>.</p>
<p><a id="columntransformer"></a></p>
<h1 id="2-apply-transformers-to-different-columns-columntransformer">2. Apply Transformers to Different Columns: ColumnTransformer</h1>
<p>Let’s go back to our original dataset where we had both numerical and categorical variables. Because we cannot apply
mean imputation to categorical variables (there is no ‘mean’ in categories!), we would want to use something different. One of the commonly used techniques is mode imputation (filling with the most frequent category), so let’s use that.</p>
<p>Mean imputation for numerical variables and mode imputation for categorical variables - can we do this in Pipeline as below?</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Can we do this?
</span><span class="n">pipe</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">([(</span><span class="s">'num_imputer'</span><span class="p">,</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'mean'</span><span class="p">)),</span>
<span class="p">(</span><span class="s">'cat_imputer'</span><span class="p">,</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'most_frequent'</span><span class="p">)),</span>
<span class="p">(</span><span class="s">'lasso'</span><span class="p">,</span> <span class="n">Lasso</span><span class="p">())])</span>
<span class="n">pipe</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
</code></pre></div></div>
<p>Unfortunately, no! If you run the above code, it will throw an error like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ValueError: Cannot use mean strategy with non-numeric data:
could not convert string to float: 'RL'
</code></pre></div></div>
<p>The error happens when <code class="language-plaintext highlighter-rouge">Pipeline</code> attempts to apply mean imputation to all of the columns including a categorical
variable that contains a string category called <code class="language-plaintext highlighter-rouge">'RL'</code>. Remember mean imputation can only be applied to numerical variables so our <code class="language-plaintext highlighter-rouge">SimpleImputer(strategy='mean')</code> freaked out!</p>
<p><strong>We need to let our <code class="language-plaintext highlighter-rouge">Pipeline</code> know which columns to apply which transformer. How do we do that? We do it with <code class="language-plaintext highlighter-rouge">ColumnTransformer</code>!</strong></p>
<p><code class="language-plaintext highlighter-rouge">ColumnTransformer</code> is similar to <code class="language-plaintext highlighter-rouge">Pipeline</code> in the sense that you put transformers together as a list of tuples, but
in this time, you pass one more argument: a list of the column names you want to apply a transformer.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># applying different transformers to different columns
</span><span class="n">transformer</span> <span class="o">=</span> <span class="n">ColumnTransformer</span><span class="p">(</span>
<span class="p">[(</span><span class="s">'numerical'</span><span class="p">,</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'mean'</span><span class="p">),</span> <span class="n">num_cols</span><span class="p">),</span>
<span class="p">(</span><span class="s">'categorical'</span><span class="p">,</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'most_frequent'</span><span class="p">),</span> <span class="n">cat_cols</span><span class="p">)])</span>
<span class="c1"># fit transformer with out train data
</span><span class="n">transformer</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="c1"># transform the train data and create a DataFrame with the transformed data
</span><span class="n">X_train_transformed</span> <span class="o">=</span> <span class="n">transformer</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">X_train_transformed</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_train_transformed</span><span class="p">,</span>
<span class="n">columns</span><span class="o">=</span><span class="nb">list</span><span class="p">(</span><span class="n">num_cols</span><span class="p">)</span> <span class="o">+</span> <span class="nb">list</span><span class="p">(</span><span class="n">cat_cols</span><span class="p">))</span>
</code></pre></div></div>
<p>You may have noticed we defined the output columns to be <code class="language-plaintext highlighter-rouge">list(num_cols) + list(cat_cols)</code>, not <code class="language-plaintext highlighter-rouge">X_train.columns</code>. This is because <strong><code class="language-plaintext highlighter-rouge">ColumnTransformer</code> fits each transformer independently in parallel and concatenates all of the outputs at the end</strong>.</p>
<p>That is, <code class="language-plaintext highlighter-rouge">ColumnTransformer</code> takes <strong>only</strong> numerical columns (<code class="language-plaintext highlighter-rouge">num_cols</code>), fits and transforms them using
<code class="language-plaintext highlighter-rouge">SimpleImputer(strategy='mean')</code>, sets the output aside. At the same time, it does the same thing for categorical
columns (<code class="language-plaintext highlighter-rouge">cat_cols</code>) with <code class="language-plaintext highlighter-rouge">SimpleImputer(strategy='most_frequent')</code>. When it is done with each and every step, it
combines all of the two outputs in the order that the transformers are performed. Therefore, <strong>be aware of the column orders because the final output may be different from your original DataFrame!</strong></p>
<p>Note that <code class="language-plaintext highlighter-rouge">ColumnTransformer</code> can only be used for transformers, not estimators. We cannot include <code class="language-plaintext highlighter-rouge">Lasso()</code> and fit the model as we did with <code class="language-plaintext highlighter-rouge">Pipeline</code>. <strong><code class="language-plaintext highlighter-rouge">ColumnTransformer</code> is only used for data pre-processing, so there is no <code class="language-plaintext highlighter-rouge">predict</code> or <code class="language-plaintext highlighter-rouge">score</code> as in <code class="language-plaintext highlighter-rouge">Pipeline</code></strong>. To train a model and calculate a performance score, we will need <code class="language-plaintext highlighter-rouge">Pipeline</code> again.</p>
<p><a id="separate"></a></p>
<h1 id="3-separate-feature-engineering-pipelines-for-numerical-and-categorical-variables">3. Separate Feature Engineering Pipelines for Numerical and Categorical Variables</h1>
<p>Let’s go one step further and include more feature engineering steps. In addition to the missing data imputation, we
also want to scale our numerical variables using <code class="language-plaintext highlighter-rouge">StandardScaler</code> and encode the categorical variables using
<code class="language-plaintext highlighter-rouge">OneHotEncoder</code>. Can we do something like this then?</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Can we do this?
</span><span class="n">transformer</span> <span class="o">=</span> <span class="n">ColumnTransformer</span><span class="p">(</span>
<span class="p">[(</span><span class="s">'numerical_imputer'</span><span class="p">,</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'mean'</span><span class="p">),</span> <span class="n">num_cols</span><span class="p">),</span>
<span class="p">(</span><span class="s">'numerical_scaler'</span><span class="p">,</span> <span class="n">StandardScaler</span><span class="p">(),</span> <span class="n">num_cols</span><span class="p">),</span>
<span class="p">(</span><span class="s">'categorical_imputer'</span><span class="p">,</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'most_frequent'</span><span class="p">),</span> <span class="n">cat_cols</span><span class="p">),</span>
<span class="p">(</span><span class="s">'categorical_encoder'</span><span class="p">,</span> <span class="n">OneHotEncoder</span><span class="p">(</span><span class="n">handle_unknown</span><span class="o">=</span><span class="s">'ignore'</span><span class="p">),</span> <span class="n">cat_cols</span><span class="p">)])</span>
<span class="n">transformer</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
</code></pre></div></div>
<p>No!</p>
<p>As we saw in the previous section, each step in <code class="language-plaintext highlighter-rouge">ColumnTransformer</code> is independent. Therefore, the input for the <code class="language-plaintext highlighter-rouge">OneHotEncoder()</code> is not the output of the <code class="language-plaintext highlighter-rouge">SimpleImputer(strategy='most_frequent')</code> but just a subset of the original DataFrame (<code class="language-plaintext highlighter-rouge">cat_cols</code>) which is not imputed. You cannot one-hot-encode a categorical variable that has missing data.</p>
<p>We need something that can sequentially pass data throughout multiple feature engineering steps. Sequentially moving data… sounds familiar, right? Yes, you can do this with <code class="language-plaintext highlighter-rouge">Pipeline</code>!</p>
<p>However, we need to create a feature engineering pipeline for numerical variables and categorical variables separately. So, we can come up with something like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># feature engineering pipeline for numerical variables
</span><span class="n">num_pipeline</span><span class="o">=</span> <span class="n">Pipeline</span><span class="p">([(</span><span class="s">'imputer'</span><span class="p">,</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'mean'</span><span class="p">)),</span>
<span class="p">(</span><span class="s">'scaler'</span><span class="p">,</span> <span class="n">StandardScaler</span><span class="p">())])</span>
<span class="c1"># feature engineering pipeline for categorical variables
</span><span class="n">cat_pipeline</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">([(</span><span class="s">'imputer'</span><span class="p">,</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'most_frequent'</span><span class="p">)),</span>
<span class="p">(</span><span class="s">'encoder'</span><span class="p">,</span> <span class="n">OneHotEncoder</span><span class="p">(</span><span class="n">handle_unknown</span><span class="o">=</span><span class="s">'ignore'</span><span class="p">))])</span>
</code></pre></div></div>
<p>You can think it as creating a ‘new transformer’ that combines multiple transformers for each type of variable. Doesn’t it sounds cool?</p>
<p><a id="final"></a></p>
<h1 id="4-final-pipeline">4. Final Pipeline</h1>
<p>Okay. Now that we have feature engineering pipelines defined for both numerical variables and categorical variables, we can put things together to train a Lasso model using <code class="language-plaintext highlighter-rouge">ColumnTransformer</code> and <code class="language-plaintext highlighter-rouge">Pipeline</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># put numerical and categorical feature engineering pipelines together
</span><span class="n">preprocessor</span> <span class="o">=</span> <span class="n">ColumnTransformer</span><span class="p">([(</span><span class="s">"num_pipeline"</span><span class="p">,</span> <span class="n">num_pipeline</span><span class="p">,</span> <span class="n">num_cols</span><span class="p">),</span>
<span class="p">(</span><span class="s">"cat_pipeline"</span><span class="p">,</span> <span class="n">cat_pipeline</span><span class="p">,</span> <span class="n">cat_cols</span><span class="p">)])</span>
<span class="c1"># put transformers and an estimator together
</span><span class="n">pipe</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">([(</span><span class="s">'preprocessing'</span><span class="p">,</span> <span class="n">preprocessor</span><span class="p">),</span>
<span class="p">(</span><span class="s">'lasso'</span><span class="p">,</span> <span class="n">Lasso</span><span class="p">(</span><span class="n">max_iter</span><span class="o">=</span><span class="mi">10000</span><span class="p">))])</span> <span class="c1"># increased max_iter to converge
</span>
<span class="c1"># fit model
</span><span class="n">pipe</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">pipe</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.9483539967729575
</code></pre></div></div>
<p>This is very neat! We applied different sets of feature engineering steps to numercial and categorical variables and then trained a model in only a few lines of code.</p>
<p>Thinking of how long and complex the code would be without <code class="language-plaintext highlighter-rouge">ColumnTransformer</code> and <code class="language-plaintext highlighter-rouge">Pipeline</code>, aren’t you tempted to try this out right now?</p>
<p><a id="summary"></a></p>
<h1 id="summary">Summary</h1>
<p>In this post, we looked at how to combine feature engineering steps and a model fitting step together using <code class="language-plaintext highlighter-rouge">Pipeline</code> and <code class="language-plaintext highlighter-rouge">ColumnTransformer</code>. Especially we learned that we can use</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">Pipeline</code> for combining transformers and an estimator</li>
<li><code class="language-plaintext highlighter-rouge">ColumnTransformer</code> for applying different transformers to different columns</li>
<li><code class="language-plaintext highlighter-rouge">Pipeline</code> for creating different feature engineering pipelines for numerical and categorical variables that sequentially apply a different set of transformers</li>
</ul>
<p>Also, check out the table below to recap the differences between <code class="language-plaintext highlighter-rouge">Pipeline</code> vs. <code class="language-plaintext highlighter-rouge">ColumnTransformer</code>:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Pipeline</th>
<th>ColumnTransformer</th>
</tr>
</thead>
<tbody>
<tr>
<td>Used for</td>
<td>Both/either of transformers and estimator</td>
<td>Transformers only</td>
</tr>
<tr>
<td>Main methods</td>
<td>fit, transform, predict, and score</td>
<td>fit, and transform (no predict or score)</td>
</tr>
<tr>
<td>Can pick columns to apply</td>
<td>No</td>
<td>Yes</td>
</tr>
<tr>
<td>Each step is performed</td>
<td>Sequentially</td>
<td>Independently</td>
</tr>
<tr>
<td>Transformed output columns</td>
<td>Same as input</td>
<td>May differ depending on the defined steps</td>
</tr>
</tbody>
</table>
<p><a id="References"></a></p>
<h1 id="references">References</h1>
<ul>
<li><a href="https://towardsdatascience.com/pipeline-columntransformer-and-featureunion-explained-f5491f815f">Pipeline, ColumnTransformer and FeatureUnion explained</a></li>
<li><a href="https://www.udemy.com/course/feature-engineering-for-machine-learning/">Feature Engineering for Machine Learning</a></li>
<li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">sklearn Pipeline</a></li>
<li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html">sklearn ColumnTransformer</a></li>
</ul>Photo by Crystal Kwok on UnsplashMissing Data Imputation Using sklearn2020-11-21T04:00:00+00:002020-11-21T04:00:00+00:00/python/2020/11/21/Missing-data-imputation-using-sklearn<h1 id="contents">Contents</h1>
<ul>
<li><a href="#why_missing_matter">Why does missing data matter?</a></li>
<li><a href="#options">What are the options for missing data imputation?</a></li>
<li><a href="#sklearn">Missing data imputation using scikit-learn</a>
<ul>
<li><a href="#prep">(0) Prepare data</a></li>
<li><a href="#mean">(1) Mean/median</a></li>
<li><a href="#mode">(2) Mode (most frequent category)</a></li>
<li><a href="#arbitrary">(3) Arbitrary value</a></li>
<li><a href="#knn">(4) KNN imputer</a></li>
<li><a href="#indicator">(5) Adding Missing Indicator</a></li>
</ul>
</li>
<li><a href="#what-do-use">What to use?</a></li>
<li><a href="#References">References</a></li>
</ul>
<p><a id="why_missing_matter"></a></p>
<h1 id="why-does-missing-data-matter">Why does missing data matter?</h1>
<p>If you ever worked on raw data collected from a survey or a sensor that is not cleaned yet, you might have also faced missing data. Let’s think about a dataset of age, gender, and height as below. You want to use both age and gender to predict height, but you have some data points that either have only age or only gender. What would you do in this case?</p>
<table>
<thead>
<tr>
<th> </th>
<th>age</th>
<th>gender</th>
<th>height</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>20</td>
<td>F</td>
<td>5’4”</td>
</tr>
<tr>
<td>1</td>
<td>31</td>
<td>M</td>
<td>6’1”</td>
</tr>
<tr>
<td>2</td>
<td>40</td>
<td> </td>
<td>5’0”</td>
</tr>
<tr>
<td>3</td>
<td> </td>
<td>M</td>
<td>5’6”</td>
</tr>
</tbody>
</table>
<p>When certain fields are missing in observation, you either 1) remove the entire observation or 2) keep the observation and replace the missing values with some estimation. Analyzing with complete data after removing any missing data is called <strong>Complete Case Analysis (CCA)</strong> and replacing missing values with estimation is called <strong>missing data imputation</strong>.</p>
<p>Normally, you don’t want to remove the entire observation because the rest of the fields can still be informative. Also, when you have lots of variables that are missing in different observations, the chances are you will have to remove the majority of data points and end up being left with limited data to train a model. Even if you manage to build a model, the model will have to know how to handle missing data in production, otherwise, it will freak out and refuse to make any prediction for new data with a missing field!</p>
<p>Therefore, we would want to perform missing data imputation and this post is about how we can do that in Python.</p>
<p><a id="options"></a></p>
<h1 id="what-are-the-options-for-missing-data-imputation">What are the options for missing data imputation?</h1>
<p>There are many imputation methods available and each has pros and cons</p>
<ol>
<li>Univariate methods (use values in one variable)
<ul>
<li>Numerical
<ul>
<li>mean, median, mode (most frequent value), arbitrary value (out of distribution)</li>
<li>For time series: linear interpolation, last observation carried forward, next observation carried backward</li>
</ul>
</li>
<li>Categorical
<ul>
<li>mode (most frequent category), arbitrary value (e.g. “missing” category)</li>
</ul>
</li>
<li>Both
<ul>
<li>random value selected from train data separately for each missing data</li>
</ul>
</li>
</ul>
</li>
<li>Multi-variable methods (use values in other variables as well)
<ul>
<li>KNN</li>
<li>Regression</li>
<li>Chained equation</li>
</ul>
</li>
<li>Adding missing indicator
<ul>
<li>Adding boolean value to indicate the observation has missing data or not. It is used with one of the above methods.</li>
</ul>
</li>
</ol>
<p>Although they are all useful in one way or another, in this post, we will focus on 6 major imputation techniques available in sklearn: mean, median, mode, arbitrary, KNN, adding a missing indicator. I will cover why we choose sklearn for our missing imputation in the next post.</p>
<p><a id="sklearn"></a></p>
<h1 id="missing-data-imputation-using-scikit-learn">Missing data imputation using scikit-learn</h1>
<p><a id="prep"></a></p>
<h2 id="0-prepare-data">(0) Prepare data</h2>
<p>In this post, we will use the trainset from the <a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data">house price data from Kaggle</a>. The data is preprocessed so that string value <code class="language-plaintext highlighter-rouge">?</code> is transformed into <code class="language-plaintext highlighter-rouge">NaN</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="c1"># prep dataset
</span><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="c1"># imputer
</span><span class="kn">from</span> <span class="nn">sklearn.impute</span> <span class="kn">import</span> <span class="n">SimpleImputer</span><span class="p">,</span> <span class="n">KNNImputer</span>
<span class="c1"># plot for comparison
</span><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'../data/house_price/train.csv'</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="s">'Id'</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">shape</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(1460, 80)
</code></pre></div></div>
<p>There are 80 columns, where 79 are features and 1 is our target variable <code class="language-plaintext highlighter-rouge">SalePrice</code>. Let’s check how many are numerical and categorical as we will apply different impuation strategies to different data types. <code class="language-plaintext highlighter-rouge">.select_dtypes()</code> in pandas is a handy way to filter data types.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># numerical columns vs. categorical columns
</span><span class="n">num_cols</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'SalePrice'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">select_dtypes</span><span class="p">(</span><span class="s">'number'</span><span class="p">).</span><span class="n">columns</span>
<span class="n">cat_cols</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'SalePrice'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">select_dtypes</span><span class="p">(</span><span class="s">'object'</span><span class="p">).</span><span class="n">columns</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Number of numerical columns: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">num_cols</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Number of categorical columns: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">cat_cols</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Number of numerical columns: 36
Number of categorical columns: 43
</code></pre></div></div>
<p>Next is slitting data. <strong>It is important to split the data into train and test set BEFORE, not after, applying any feature engineering or feature selection steps</strong> in order to avoid data leakage. Data leakage means using information that is not available in production during training which leads to model performance inflation. As we want our model performance score to be as close to the real performance in production as possible, we want to split the data as early as possible even before feature engineering steps.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'SalePrice'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
<span class="n">df</span><span class="p">[</span><span class="s">'SalePrice'</span><span class="p">],</span>
<span class="n">test_size</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span>
<span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">X_train</span><span class="p">.</span><span class="n">shape</span><span class="p">,</span> <span class="n">X_test</span><span class="p">.</span><span class="n">shape</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>((1022, 79), (438, 79))
</code></pre></div></div>
<p>Now let’s check which columns have missing data, <code class="language-plaintext highlighter-rouge">NaN</code>. <code class="language-plaintext highlighter-rouge">.isna()</code> will give you True/False indicator of if element is <code class="language-plaintext highlighter-rouge">NaN</code> and <code class="language-plaintext highlighter-rouge">.mean()</code> will calculate what perforcentage of True there are in each column. We will filter columns with mean greater than 0, which means there is at least one missing data.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># number of numerical columns and categorical columns that contain missing data
</span><span class="n">num_cols_with_na</span> <span class="o">=</span> <span class="n">num_cols</span><span class="p">[</span><span class="n">X_train</span><span class="p">[</span><span class="n">num_cols</span><span class="p">].</span><span class="n">isna</span><span class="p">().</span><span class="n">mean</span><span class="p">()</span> <span class="o">></span> <span class="mi">0</span><span class="p">]</span>
<span class="n">cat_cols_with_na</span> <span class="o">=</span> <span class="n">cat_cols</span><span class="p">[</span><span class="n">X_train</span><span class="p">[</span><span class="n">cat_cols</span><span class="p">].</span><span class="n">isna</span><span class="p">().</span><span class="n">mean</span><span class="p">()</span> <span class="o">></span> <span class="mi">0</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"*** numerical columns that have NaN's (</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">num_cols_with_na</span><span class="p">)</span><span class="si">}</span><span class="s">): </span><span class="se">\n</span><span class="si">{</span><span class="n">num_cols_with_na</span><span class="si">}</span><span class="se">\n\n</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"*** categorical columns that have NaN's (</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">cat_cols_with_na</span><span class="p">)</span><span class="si">}</span><span class="s">): </span><span class="se">\n</span><span class="si">{</span><span class="n">cat_cols_with_na</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*** numerical columns that have NaN's (3):
Index(['LotFrontage', 'MasVnrArea', 'GarageYrBlt'], dtype='object')
*** categorical columns that have NaN's (16):
Index(['Alley', 'MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'FireplaceQu',
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC',
'Fence', 'MiscFeature'],
dtype='object')
</code></pre></div></div>
<p>Let’s check which feature has how much missing data. It seems that there are three features (<code class="language-plaintext highlighter-rouge">PoolQC</code>, <code class="language-plaintext highlighter-rouge">MiscFeature</code>, <code class="language-plaintext highlighter-rouge">Alley</code>) that have more than 90% of data missing. In such cases, it might be better to remove the entire feature because they do not provide much information when predicting house price. We could perform feature selection to see if it is worth including them or not. However, it is not the scope of this post, so we will include all of them for now.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># percentage of missing data in numerical features
</span><span class="n">X_train</span><span class="p">[</span><span class="n">num_cols_with_na</span><span class="p">].</span><span class="n">isna</span><span class="p">().</span><span class="n">mean</span><span class="p">().</span><span class="n">sort_values</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LotFrontage 0.184932
GarageYrBlt 0.052838
MasVnrArea 0.004892
dtype: float64
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># percentage of missing data in categorical features
</span><span class="n">X_train</span><span class="p">[</span><span class="n">cat_cols_with_na</span><span class="p">].</span><span class="n">isna</span><span class="p">().</span><span class="n">mean</span><span class="p">().</span><span class="n">sort_values</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PoolQC 0.997065
MiscFeature 0.956947
Alley 0.939335
Fence 0.813112
FireplaceQu 0.467710
GarageCond 0.052838
GarageQual 0.052838
GarageFinish 0.052838
GarageType 0.052838
BsmtFinType2 0.024462
BsmtFinType1 0.023483
BsmtExposure 0.023483
BsmtCond 0.023483
BsmtQual 0.023483
MasVnrType 0.004892
Electrical 0.000978
dtype: float64
</code></pre></div></div>
<p><a id="mean"></a></p>
<h2 id="1-meanmedian">(1) Mean/median</h2>
<p>First missing data imputation method we will look at is mean/median imputation. As the name implies, it fills missing data with the mean or the median of each variable.</p>
<p>When should we mean vs median? If the variable is normally distributed, the mean and the median do not differ a lot. However, if the distribution is skewed, the mean is affected by outliers and can deviate a lot from the mean, so the median is a better representationo for skewed data. Therefore, <strong>use the mean for normal distribution and the median for skewed distribution.</strong></p>
<div style="text-align:center">
<img src="/images/impute_sklearn/skew_dist.png" alt="drawing" width="300" />
<figcaption>Fig 1. Skewness
<a href="https://en.wikipedia.org/wiki/Skewness">(Skewness)</a>
</figcaption>
</div>
<h3 id="assumptions">Assumptions</h3>
<ul>
<li>Missing data most likely look like the majority of the data</li>
<li>Data is missing at random</li>
</ul>
<h3 id="pros">Pros</h3>
<ul>
<li>Easy and fast</li>
<li>Can easily be integrated into production</li>
</ul>
<h3 id="cons">Cons</h3>
<ul>
<li>It distorts the original variable distribution (more values around the mean will create more outliers)</li>
<li>It ignores and distorts the correlation with other variables</li>
</ul>
<p><strong>A common practice is to use mean/median imputation with combination of ‘missing indicator’ that we will learn in a later section. This is the top choice in data science competitions</strong>.</p>
<p>Below is how we use the mean/median imputation. <strong>It only works for numerical data</strong>. To make it simple, we used columns with NA’s here (<code class="language-plaintext highlighter-rouge">X_train[num_cols_with_na]</code>).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># initialize imputer. use strategy='median' for median imputation
</span><span class="n">imputer</span> <span class="o">=</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'mean'</span><span class="p">)</span>
<span class="c1"># fit the imputer on X_train. we pass only numeric columns with NA's here.
</span><span class="n">imputer</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">num_cols_with_na</span><span class="p">])</span>
<span class="c1"># transform the data using the fitted imputer
</span><span class="n">X_train_mean_impute</span> <span class="o">=</span> <span class="n">imputer</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">num_cols_with_na</span><span class="p">])</span>
<span class="n">X_test_mean_impute</span> <span class="o">=</span> <span class="n">imputer</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="n">num_cols_with_na</span><span class="p">])</span>
<span class="c1"># put the output into DataFrame. remember to pass columns used in fit/transform
</span><span class="n">X_train_mean_impute</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_train_mean_impute</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">num_cols_with_na</span><span class="p">)</span>
<span class="n">X_test_mean_impute</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_test_mean_impute</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">num_cols_with_na</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># check statistics
</span><span class="k">print</span><span class="p">(</span><span class="s">"Imputer statistics (mean values):"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">imputer</span><span class="p">.</span><span class="n">statistics_</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Imputer statistics (mean values):
[ 69.66866747 103.55358899 1978.01239669]
</code></pre></div></div>
<p>Previously, we mentioned the mean imputation can distort the original distribution. Let’s see how much it changed the data distribution by checking the density plot.</p>
<p>As you can see in <code class="language-plaintext highlighter-rouge">LotFrontage</code>, you have a lot more values around the mean after the imputation. The more missing data a variable has, the bigger the distortion is (<code class="language-plaintext highlighter-rouge">LotFrontage</code> has 18%, <code class="language-plaintext highlighter-rouge">GarageYrBlt</code> has 5%, and <code class="language-plaintext highlighter-rouge">MasVnrArea</code> has 0.5% of missing data).</p>
<p>One way to avoid this side effect is to use random data imputation. However, I excluded it from this post as it is not available in sklearn and it is not very production-friendly. It requires the whole populatlion of train data to be available to impute each missing data point.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># compare the distribution before/after mean imputation
</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">num_cols_with_na</span><span class="p">)):</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="n">col</span> <span class="o">=</span> <span class="n">num_cols_with_na</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">X_train</span><span class="p">[</span><span class="n">col</span><span class="p">].</span><span class="n">plot</span><span class="p">.</span><span class="n">kde</span><span class="p">()</span>
<span class="n">X_train_mean_impute</span><span class="p">[</span><span class="n">col</span><span class="p">].</span><span class="n">plot</span><span class="p">.</span><span class="n">kde</span><span class="p">()</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="n">col</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">([</span><span class="s">'original'</span><span class="p">,</span> <span class="s">'mean imputation'</span><span class="p">])</span>
<span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
</code></pre></div></div>
<div style="text-align:center">
<img src="/images/impute_sklearn/output_18_0.png" />
</div>
<p><br /></p>
<p><a id="mode"></a></p>
<h2 id="2-mode-most-frequent-category">(2) Mode (most frequent category)</h2>
<p>The second method is mode imputation. It is replacing missing values with the most frequent value in a variable. <strong>It can be used for both numerical and categorical</strong>.</p>
<h3 id="assumptions-1">Assumptions</h3>
<ul>
<li>Missing data most likely look like the majority of the data</li>
<li>Data is missing at random</li>
</ul>
<h3 id="pros-1">Pros</h3>
<ul>
<li>Easy and fast</li>
<li>Can easily be integrated into production</li>
</ul>
<h3 id="cons-1">Cons</h3>
<ul>
<li>It distorts the original variable distribution (more values around the mean will create more outliers)</li>
<li>It ignores and distorts the correlation with other variables</li>
<li>The most frequent label might be over-represented while it is not the most representative value of a variable</li>
</ul>
<p>This time, let’s try it to our categorical variables.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># initialize imputer
</span><span class="n">imputer</span> <span class="o">=</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'most_frequent'</span><span class="p">)</span>
<span class="c1"># fit the imputer on X_train. pass only numeric columns.
</span><span class="n">imputer</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">cat_cols_with_na</span><span class="p">])</span>
<span class="c1"># transform the data using the fitted imputer
</span><span class="n">X_train_mode_impute</span> <span class="o">=</span> <span class="n">imputer</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">cat_cols_with_na</span><span class="p">])</span>
<span class="n">X_test_mode_impute</span> <span class="o">=</span> <span class="n">imputer</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="n">cat_cols_with_na</span><span class="p">])</span>
<span class="c1"># put the output into DataFrame. remember to pass columns used in fit/transform
</span><span class="n">X_train_mode_impute</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_train_mode_impute</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">cat_cols_with_na</span><span class="p">)</span>
<span class="n">X_test_mode_impute</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_test_mode_impute</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">cat_cols_with_na</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># check statistics
</span><span class="k">print</span><span class="p">(</span><span class="s">"Imputer statistics (the most frequent values in each variable):"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">imputer</span><span class="p">.</span><span class="n">statistics_</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Imputer statistics (the most frequent values in each variable):
['Pave' 'None' 'TA' 'TA' 'No' 'Unf' 'Unf' 'SBrkr' 'Gd' 'Attchd' 'Unf' 'TA'
'TA' 'Gd' 'MnPrv' 'Shed']
</code></pre></div></div>
<p>Like the mean/median imputation, mode imputation can also distort the original distribution of a variable. In order to check the difference between before/after the mode imputation, we used bar plot this time as it is for categorical variables.</p>
<p>Let’s take a look at the first variable in the graph, <code class="language-plaintext highlighter-rouge">Alley</code>. As you can see, the distribution of the original data and that of the imputated data are very different and the <code class="language-plaintext highlighter-rouge">Pave</code> category is over-represented in the imputed data. Ideally the shape of the distribution should be preserved after imputation just like <code class="language-plaintext highlighter-rouge">MasVnrType</code>. However, if the majority of the observations is missing, the distribution of a variable can change significantly as does the correlation with other variables.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># compare the distribution before/after mode imputation
</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">15</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">cat_cols_with_na</span><span class="p">)):</span>
<span class="n">col_name</span> <span class="o">=</span> <span class="n">cat_cols_with_na</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">original</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">[</span><span class="n">col_name</span><span class="p">].</span><span class="n">value_counts</span><span class="p">()</span>
<span class="n">imputed</span> <span class="o">=</span> <span class="n">X_train_mode_impute</span><span class="p">[</span><span class="n">col_name</span><span class="p">].</span><span class="n">value_counts</span><span class="p">()</span>
<span class="n">combined</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">original</span><span class="p">,</span> <span class="n">imputed</span><span class="p">],</span> <span class="n">keys</span><span class="o">=</span><span class="p">[</span><span class="s">'original'</span><span class="p">,</span> <span class="s">'mode imputation'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="o">//</span><span class="mi">4</span><span class="p">,</span> <span class="n">i</span><span class="o">%</span><span class="mi">4</span><span class="p">]</span>
<span class="n">combined</span><span class="p">.</span><span class="n">plot</span><span class="p">.</span><span class="n">bar</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="n">col_name</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
</code></pre></div></div>
<div style="text-align:center">
<img src="/images/impute_sklearn/output_23_0.png" />
</div>
<p><br /></p>
<p><a id="arbitrary"></a></p>
<h2 id="3-arbitrary-value">(3) Arbitrary value</h2>
<p>The third method is filling missing values with an arbitrary value outside of the training dataset. For example, if the values in the ‘age’ variable range from 0 to 80 in the training set, fill missing data with 100 (or using a value at the ‘end of distribution’ using mean +- 3*std). If categorical data, use ‘Missing’ as a new category for missing data. It can be counter-intuitive to fill data with a value outside of the original distribution as it will create outliers or unseen data. It indeed is not meant to be used for models that require certain assumptions about data distribution, such as linear regression. It is good for three-based models which will separate missing data in an earlier/upper node and take the missingness into account when building a model.</p>
<h3 id="assumptions-2">Assumptions</h3>
<ul>
<li>If data is not missing at random, they would want to flag them with a very different value than other observations and have them treated differently by a model</li>
</ul>
<h3 id="pros-2">Pros</h3>
<ul>
<li>Easy and fast</li>
<li>It captures the importance of ‘missingness’</li>
</ul>
<h3 id="cons-2">Cons</h3>
<ul>
<li>It distorts the original data distribution</li>
<li>It distorts the correlation between variables</li>
<li>It may mask or create outliers if numerical variables</li>
<li>It is not for linear models. Only use it for tree-based models.</li>
</ul>
<p>It can be used for both numerical and categorical and numerical variable is more involved if we need to determine the fill value automatically.</p>
<p>Let’s see how we do for categorical variables first.</p>
<h3 id="categorical">Categorical</h3>
<p>Filling missing values with a new category called ‘missing’ or ‘Missing’ is a very common strategy for imputing missing data in categorical variable.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># initialize imputer
</span><span class="n">imputer</span> <span class="o">=</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'constant'</span><span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="s">'Missing'</span><span class="p">)</span>
<span class="c1"># fit the imputer on X_train. pass only numeric columns.
</span><span class="n">imputer</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">cat_cols_with_na</span><span class="p">])</span>
<span class="c1"># transform the data using the fitted imputer
</span><span class="n">X_train_arb_impute</span> <span class="o">=</span> <span class="n">imputer</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">cat_cols_with_na</span><span class="p">])</span>
<span class="n">X_test_arb_impute</span> <span class="o">=</span> <span class="n">imputer</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="n">cat_cols_with_na</span><span class="p">])</span>
<span class="c1"># put the output into DataFrame. remember to pass columns used in fit/transform
</span><span class="n">X_train_arb_impute</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_train_arb_impute</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">cat_cols_with_na</span><span class="p">)</span>
<span class="n">X_test_arb_impute</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_test_arb_impute</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">cat_cols_with_na</span><span class="p">)</span>
</code></pre></div></div>
<p>You can see there is now a new category ‘Missing’ in the imputed dataset.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># compare the distribution before/after mode imputation
</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">15</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">cat_cols_with_na</span><span class="p">)):</span>
<span class="n">col_name</span> <span class="o">=</span> <span class="n">cat_cols_with_na</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">original</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">[</span><span class="n">col_name</span><span class="p">].</span><span class="n">value_counts</span><span class="p">()</span>
<span class="n">imputed</span> <span class="o">=</span> <span class="n">X_train_arb_impute</span><span class="p">[</span><span class="n">col_name</span><span class="p">].</span><span class="n">value_counts</span><span class="p">()</span>
<span class="n">combined</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">original</span><span class="p">,</span> <span class="n">imputed</span><span class="p">],</span>
<span class="n">keys</span><span class="o">=</span><span class="p">[</span><span class="s">'original'</span><span class="p">,</span> <span class="s">'Arbitrary value imputation'</span><span class="p">],</span>
<span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="o">//</span><span class="mi">4</span><span class="p">,</span> <span class="n">i</span><span class="o">%</span><span class="mi">4</span><span class="p">]</span>
<span class="n">combined</span><span class="p">.</span><span class="n">plot</span><span class="p">.</span><span class="n">bar</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="n">col_name</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
</code></pre></div></div>
<div style="text-align:center">
<img src="/images/impute_sklearn/output_28_0.png" />
</div>
<p><br /></p>
<h3 id="numerical">Numerical</h3>
<p>When determinining what value to use for numerical variables, one way to do is ‘end of distribution’ method.</p>
<p>For normal distribution: mean $\pm 3\times$ std<br />
For skewed distribution, use either upper limit or lower limit:</p>
<ul>
<li>Upper limit = 75th quantile + $1.5\times$ IQR</li>
<li>Lower limit = 25th quantile + $1.5\times$ IQR</li>
</ul>
<p>where IQR = 75th qualtile - 25th quantile.</p>
<p>Let’s check the</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># first find the value to use
</span><span class="k">def</span> <span class="nf">get_end_of_dist</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">col</span><span class="p">):</span>
<span class="n">q1</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">[</span><span class="n">col</span><span class="p">].</span><span class="n">quantile</span><span class="p">(</span><span class="mf">0.25</span><span class="p">)</span>
<span class="n">q3</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">[</span><span class="n">col</span><span class="p">].</span><span class="n">quantile</span><span class="p">(</span><span class="mf">0.75</span><span class="p">)</span>
<span class="n">iqr</span> <span class="o">=</span> <span class="n">q3</span><span class="o">-</span><span class="n">q1</span>
<span class="n">new_val</span> <span class="o">=</span> <span class="n">q3</span> <span class="o">+</span> <span class="n">iqr</span> <span class="o">*</span> <span class="mi">3</span>
<span class="k">return</span> <span class="n">new_val</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># determine which column to impute
</span><span class="n">col</span> <span class="o">=</span> <span class="s">'LotFrontage'</span>
<span class="c1"># initialize imputer
</span><span class="n">imputer</span> <span class="o">=</span> <span class="n">SimpleImputer</span><span class="p">(</span><span class="n">strategy</span><span class="o">=</span><span class="s">'constant'</span><span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="n">get_end_of_dist</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">col</span><span class="p">))</span>
<span class="c1"># fit the imputer on X_train. pass only numeric columns.
</span><span class="n">imputer</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">[[</span><span class="n">col</span><span class="p">]])</span>
<span class="c1"># transform the data using the fitted imputer
</span><span class="n">X_train_arb_impute</span> <span class="o">=</span> <span class="n">imputer</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">[[</span><span class="n">col</span><span class="p">]])</span>
<span class="n">X_test_arb_impute</span> <span class="o">=</span> <span class="n">imputer</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">[[</span><span class="n">col</span><span class="p">]])</span>
<span class="c1"># put the output into DataFrame. remember to pass columns used in fit/transform
</span><span class="n">X_train_arb_impute</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_train_arb_impute</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="n">col</span><span class="p">])</span>
<span class="n">X_test_arb_impute</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_test_arb_impute</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="n">col</span><span class="p">])</span>
</code></pre></div></div>
<p>When we check the plot below, we now have a small peak at around 150, which is the value that is determined by our <code class="language-plaintext highlighter-rouge">get_end_of_dist</code> function. This method definitely distorts the original data distribution and should be used carefully for only appropriate models.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># compare the distribution before/after mean imputation
</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">()</span>
<span class="n">X_train</span><span class="p">[</span><span class="n">col</span><span class="p">].</span><span class="n">plot</span><span class="p">.</span><span class="n">kde</span><span class="p">()</span>
<span class="n">X_test_arb_impute</span><span class="p">[</span><span class="n">col</span><span class="p">].</span><span class="n">plot</span><span class="p">.</span><span class="n">kde</span><span class="p">()</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="n">col</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">([</span><span class="s">'original'</span><span class="p">,</span> <span class="s">'End of tail imputation'</span><span class="p">])</span>
<span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
</code></pre></div></div>
<div style="text-align:center">
<img src="/images/impute_sklearn/output_33_0.png" />
</div>
<p><br /></p>
<p><a id="knn"></a></p>
<h2 id="4-knn-imputer">(4) KNN imputer</h2>
<p>KNN imputer is much more sophisticated and nuanced than the imputation methods described so far because it uses other data points and variables, not just the variable the missing data is coming from. KNN imputer calculates the distance between points (usually based on Eucledean distance) and finds the K closest (=similar) points. And then, estimates the missing value given what other points have for the variable. Note that this can only be used for numerical variables.</p>
<h3 id="pros-3">Pros</h3>
<ul>
<li>More accurate than univariate imputation</li>
<li>More likely to preserve the original distribution hence the covariance</li>
</ul>
<h3 id="cons-3">Cons</h3>
<ul>
<li>Computationally expensive than univariate imputation</li>
<li>Can be sensitive to outliers (quality of other points)</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># initialize imputer
</span><span class="n">imputer</span> <span class="o">=</span> <span class="n">KNNImputer</span><span class="p">()</span>
<span class="c1"># fit the imputer on X_train. pass only numeric columns.
</span><span class="n">imputer</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">num_cols_with_na</span><span class="p">])</span>
<span class="c1"># transform the data using the fitted imputer
</span><span class="n">X_train_knn_impute</span> <span class="o">=</span> <span class="n">imputer</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">num_cols_with_na</span><span class="p">])</span>
<span class="n">X_test_knn_impute</span> <span class="o">=</span> <span class="n">imputer</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="n">num_cols_with_na</span><span class="p">])</span>
<span class="c1"># put the output into DataFrame. remember to pass columns used in fit/transform
</span><span class="n">X_train_knn_impute</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_train_knn_impute</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">num_cols_with_na</span><span class="p">)</span>
<span class="n">X_test_knn_impute</span> <span class="o">=</span> <span class="n">imputer</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="n">num_cols_with_na</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">num_cols_with_na</span><span class="p">)):</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="n">col</span> <span class="o">=</span> <span class="n">num_cols_with_na</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">X_train</span><span class="p">[</span><span class="n">col</span><span class="p">].</span><span class="n">plot</span><span class="p">.</span><span class="n">kde</span><span class="p">()</span>
<span class="n">X_train_knn_impute</span><span class="p">[</span><span class="n">col</span><span class="p">].</span><span class="n">plot</span><span class="p">.</span><span class="n">kde</span><span class="p">()</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="n">col</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">([</span><span class="s">'original'</span><span class="p">,</span> <span class="s">'mean imputation'</span><span class="p">])</span>
<span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
</code></pre></div></div>
<div style="text-align:center">
<img src="/images/impute_sklearn/output_36_0.png" />
</div>
<p><br /></p>
<p><a id="indicator"></a></p>
<h2 id="5-adding-missing-indicator">(5) Adding Missing Indicator</h2>
<p>Adding a binary missing indicator is another common practice when it comes to missing data imputation. This is used with other imputation techniques, such as mean, median, or mode imputation.</p>
<h3 id="assumptions-3">Assumptions</h3>
<ul>
<li>Data is not missing at random</li>
<li>Missingness provides information</li>
</ul>
<h3 id="pros-4">Pros</h3>
<ul>
<li>Eeasy and fast</li>
<li>It captures the importance of missingness</li>
</ul>
<h3 id="cons-4">Cons</h3>
<ul>
<li>It can expand the feature space pretty quickly if there are a lot of features</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">num_cols_with_na</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">num_cols_with_na</span> <span class="o">+</span> <span class="s">'_NA'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Index(['LotFrontage', 'MasVnrArea', 'GarageYrBlt', 'LotFrontage_NA',
'MasVnrArea_NA', 'GarageYrBlt_NA'],
dtype='object')
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># initialize imputer
</span><span class="n">imputer</span> <span class="o">=</span> <span class="n">KNNImputer</span><span class="p">(</span><span class="n">add_indicator</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># fit the imputer on X_train. pass only numeric columns.
</span><span class="n">imputer</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">num_cols_with_na</span><span class="p">])</span>
<span class="c1"># transform the data using the fitted imputer
</span><span class="n">X_train_knn_impute</span> <span class="o">=</span> <span class="n">imputer</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">num_cols_with_na</span><span class="p">])</span>
<span class="c1"># put the output into DataFrame. remember to pass columns used in fit/transform
</span><span class="n">X_train_knn_impute</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X_train_knn_impute</span><span class="p">,</span>
<span class="n">columns</span><span class="o">=</span><span class="n">num_cols_with_na</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">num_cols_with_na</span> <span class="o">+</span> <span class="s">'_NA'</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X_train_knn_impute</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>LotFrontage</th>
<th>MasVnrArea</th>
<th>GarageYrBlt</th>
<th>LotFrontage_NA</th>
<th>MasVnrArea_NA</th>
<th>GarageYrBlt_NA</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>71.6</td>
<td>573.0</td>
<td>1998.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>1</th>
<td>56.2</td>
<td>0.0</td>
<td>1996.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>2</th>
<td>50.0</td>
<td>0.0</td>
<td>1979.2</td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
</tr>
<tr>
<th>3</th>
<td>60.0</td>
<td>0.0</td>
<td>1939.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>4</th>
<td>60.0</td>
<td>0.0</td>
<td>1930.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
</tbody>
</table>
</div>
<p><a id="what-do-use"></a></p>
<h1 id="what-to-use">What to use?</h1>
<p>The obvious question to follow next is then, “What method should we use?” The answer is tricky as there is no hard answer to what the best method is for every case.</p>
<p>The most commonly used technique, as described above, is using the mean/median imputation with combination of missing data indicator for numerical variables and filling missing data with the new ‘Missing’ category for categorical variables.</p>
<p>However, it is wise to still investigate different methods by cross validating different combinations of methods and see which method is most effective to your problem. In the next post, we will learn how to do it with sklearn and why it is useful to use sklearn for imputation rather than normal Pandas functions.</p>
<p>See you in the next post!</p>
<p><a id="References"></a></p>
<h1 id="references">References</h1>
<ul>
<li><a href="https://www.udemy.com/course/feature-engineering-for-machine-learning/">Feature Engineering for Machine Learning</a></li>
<li><a href="https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779">6 Different Ways to Compensate for Missing Values In a Dataset (Data Imputation with examples)</a></li>
<li><a href="https://www.kaggle.com/juejuewang/handle-missing-values-in-time-series-for-beginners">Handle Missing Values in Time Series For Beginners</a></li>
<li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html">sklearn SimpleImputer</a></li>
<li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html">sklearn KNNImputer</a></li>
</ul>Contents Why does missing data matter? What are the options for missing data imputation? Missing data imputation using scikit-learn (0) Prepare data (1) Mean/median (2) Mode (most frequent category) (3) Arbitrary value (4) KNN imputer (5) Adding Missing Indicator What to use? ReferencesAdding a Horizontal Scroll to Overflowing Markdown Table in HTML2020-11-15T23:58:00+00:002020-11-15T23:58:00+00:00/2020/11/15/Adding-scroll-to-overflowing-table<h2 id="overflowing-table">Overflowing table</h2>
<p>If you have a wide table, it might overflow your normal post width and look really ugly. It is the case
for a lot of data
science projects as there are many features in columns to analyze! For
example, a table like
this:</p>
<table>
<thead>
<tr>
<th> </th>
<th>col_name1</th>
<th>col_name2</th>
<th>col_name3</th>
<th>col_name4</th>
<th>col_name5</th>
<th>col_name6</th>
<th>col_name7</th>
<th>col_name8</th>
<th>col_name9</th>
<th>col_name10</th>
</tr>
</thead>
<tbody>
<tr>
<td>row1</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>row2</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</tbody>
</table>
<p>Fortunately, I found a way to make such table fit nicely to my Jekyll page layout like this. (<a href="https://stackoverflow
.com/questions/41076390/is-there-a-way-to-overflow-a-markdown-table-using-html">Is there a way to
overflow a markdown table using HTML?</a>)</p>
<div class="table-wrapper">
<table>
<thead>
<tr>
<th> </th>
<th>col_name1</th>
<th>col_name2</th>
<th>col_name3</th>
<th>col_name4</th>
<th>col_name5</th>
<th>col_name6</th>
<th>col_name7</th>
<th>col_name8</th>
<th>col_name9</th>
<th>col_name10</th>
</tr>
</thead>
<tbody>
<tr>
<td>row1</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>row2</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</tbody>
</table>
</div>
<p><br /></p>
<h2 id="how-to-add-a-horizontal-scroll">How to add a horizontal scroll</h2>
<p>First, add the following wrapper rule to the css file.</p>
<div class="language-css highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">.table-wrapper</span> <span class="p">{</span>
<span class="nl">overflow-x</span><span class="p">:</span> <span class="nb">scroll</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>And then add this to before and after your table. Make sure you have a blank line between your table and the end
<code class="language-plaintext highlighter-rouge"></div></code> to
see it in effect.</p>
<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><div</span> <span class="na">class=</span><span class="s">"table-wrapper"</span> <span class="na">markdown=</span><span class="s">"block"</span><span class="nt">></span>
<span class="nt"></div></span>
</code></pre></div></div>
<p>Applying this to the above example:</p>
<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><div</span> <span class="na">class=</span><span class="s">"table-wrapper"</span> <span class="na">markdown=</span><span class="s">"block"</span><span class="nt">></span>
| | col_name1 | col_name2 | col_name3 | col_name4 | col_name5 | col_name6 | col_name7 | col_name8 | col_name9 | col_name0 |
|------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| row1 | | | | | | | | | | |
| row2 | | | | | | | | | | |
<span class="nt"></div></span>
</code></pre></div></div>Overflowing table If you have a wide table, it might overflow your normal post width and look really ugly. It is the case for a lot of data science projects as there are many features in columns to analyze! For example, a table like this:What Should I Use for Dot Product and Matrix Multiplication?: NumPy multiply vs. dot vs. matmul vs. @2020-08-30T04:00:00+00:002020-08-30T04:00:00+00:00/python/2020/08/30/numpy-matmul<p>When I first implemented gradient descent from scratch a few years ago, I was very confused which method to use for
dot product and matrix multiplications - <code class="language-plaintext highlighter-rouge">np.multiply</code> or <code class="language-plaintext highlighter-rouge">np.dot</code> or <code class="language-plaintext highlighter-rouge">np.matmul</code>? And after a few years, it turns out that… I am
still confused! So, I
decided to investigate all the options in Python and NumPy
(<code class="language-plaintext highlighter-rouge">*</code>, <code class="language-plaintext highlighter-rouge">np.multiply</code>, <code class="language-plaintext highlighter-rouge">np.dot</code>, <code class="language-plaintext highlighter-rouge">np.matmul</code>, and <code class="language-plaintext highlighter-rouge">@</code>), come up with the best approach to take, and document the findings here.</p>
<p>TLDL; Use <code class="language-plaintext highlighter-rouge">np.dot</code> for dot product. For matrix multiplication, use <code class="language-plaintext highlighter-rouge">@</code> for Python 3.5 or above, and <code class="language-plaintext highlighter-rouge">np.matmul</code> for
earlier versions.</p>
<h1 id="table-of-contents">Table of contents</h1>
<ol>
<li><a href="#dot_product">What are dot product and matrix multiplications?</a></li>
<li><a href="#numpy_array">What is available for NumPy arrays?</a><br />
(1) <a href="#asterisk">element-wise multiplication: * and sum</a><br />
(2) <a href="#np.multiply">element-wise multiplication: np.multiply and sum</a><br />
(3) <a href="#np.dot">dot product: np.dot</a><br />
(4) <a href="#np.matmul">matrix multiplication: np.matmul</a><br />
(5) <a href="#@">matrix multiplication: @</a></li>
<li><a href="#dot_vs_matmul">So.. what’s with np.not vs. np.matmul (@)?</a></li>
<li><a href="#summary">Summary</a></li>
<li><a href="#reference">Reference</a></li>
</ol>
<p><a id="dot_product"></a></p>
<h1 id="1-what-are-dot-product-and-matrix-multiplication">1. What are dot product and matrix multiplication?</h1>
<p>If you are not familiar with dot product or matrix multiplication yet or if you need a quick recap, check out the
previous blog post: <a href="/python/2020/08/23/dot-product.html">What are dot product and
matrix multiplication?</a></p>
<p>In short, the dot product is the sum of products of values in two same-sized vectors and the matrix multiplication
is a
matrix
version
of the dot product with two matrices. The output of the dot product is a scalar whereas that of the matrix
multiplication
is a matrix whose
elements are the dot products of pairs of vectors in each matrix.</p>
<p>Dot product:</p>
\[[a_1 \ a_2]
\begin{bmatrix}
b_1 \\
b_2
\end{bmatrix}
=a_1b_1 + a_2b_2\]
<p>Matrix multiplication:</p>
\[\begin{bmatrix}
a_{11} \ \ a_{12} \\
a_{21} \ \ a_{22} \\
\end{bmatrix}
\begin{bmatrix}
b_{11} \ \ b_{12} \\
b_{21} \ \ b_{22} \\
\end{bmatrix}
=
\begin{bmatrix}
a_{11}b_{11} + a_{12}b_{21} \ \ \ a_{11}b_{12} + a_{12}b_{22}\\
a_{21}b_{11} + a_{22}b_{21} \ \ \ a_{21}b_{12} + a_{22}b_{22}\\
\end{bmatrix}\]
<p><br /></p>
<p><a id="numpy_array"></a></p>
<h1 id="2-whats-available-for-numpy-arrays">2. What’s available for NumPy arrays?</h1>
<p>So, there are multiple options you can use to perform dot product or matrix multiplication:</p>
<ol>
<li>basic element-wise multiplication: <code class="language-plaintext highlighter-rouge">*</code> or <code class="language-plaintext highlighter-rouge">np.multiply</code> along with <code class="language-plaintext highlighter-rouge">np.sum</code></li>
<li>dot product: <code class="language-plaintext highlighter-rouge">np.dot</code></li>
<li>matrix multiplication: <code class="language-plaintext highlighter-rouge">np.matmul</code>, <code class="language-plaintext highlighter-rouge">@</code></li>
</ol>
<p>We will go through different scenarios depending on the dimensions of vectors/matrices and understand the pros and cons
of each method. To run the code in the following sections, We first need to import numpy.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
</code></pre></div></div>
<p><a id="asterisk"></a></p>
<h2 id="1-element-wise-multiplication--and-sum">(1) element-wise multiplication: * and sum</h2>
<p>First, we can try the fundamental approach using element-wise multiplication based on the definition of dot product:
multiply corresponding
elements in two vectors and then sum all the output values. The downside of this approach is that you need
<strong>separate operations
for product and sum</strong> and it is <strong>slower</strong> than other methods we will discuss later.</p>
<p>Here is an example of dot product with two 1D arrays.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">])</span>
<span class="o">>>></span> <span class="n">a</span><span class="o">*</span><span class="n">b</span>
<span class="n">array</span><span class="p">([</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">18</span><span class="p">])</span>
<span class="o">>>></span> <span class="nb">sum</span><span class="p">(</span><span class="n">a</span><span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="mi">32</span>
</code></pre></div></div>
<p>Can we use the same <code class="language-plaintext highlighter-rouge">*</code> and <code class="language-plaintext highlighter-rouge">sum</code> operation for matrix multiplication? Let’s check out.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">c</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span> <span class="p">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">]])</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>
<span class="o">>>></span> <span class="n">c</span><span class="o">*</span><span class="n">d</span>
<span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span>
<span class="p">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">]])</span>
<span class="o">>>></span> <span class="nb">sum</span><span class="p">(</span><span class="n">c</span><span class="o">*</span><span class="n">d</span><span class="p">)</span>
<span class="n">array</span><span class="p">([</span><span class="mi">5</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">9</span><span class="p">])</span>
</code></pre></div></div>
<p>Wait, it looks different from what we would get from our own calculation below!</p>
\[\begin{bmatrix}
1 & 2 & 3 \\
4 & 5 & 6
\end{bmatrix}
\begin{bmatrix}
1 \\
1 \\
1 \\
\end{bmatrix} =
\begin{bmatrix}
1 \times 1 + 2 \times 1 + 3 \times 1 \\
4 \times 1 + 5 \times 1 + 6 \times 1
\end{bmatrix}
=\begin{bmatrix}
6 \\
15
\end{bmatrix}\]
<p>So, it turns out that we need to be careful when we apply <code class="language-plaintext highlighter-rouge">sum</code> after <code class="language-plaintext highlighter-rouge">*</code> operation.</p>
<p>Let’s look at it step by step. Here is what happened at
<code class="language-plaintext highlighter-rouge">c*d</code>. Each
row of 2D array $c$ is
considered as
an element of the matrix and it is
paired with the second array $d$
for element-wise multiplication.</p>
\[\begin{bmatrix}
[1 & 2 & 3] * [1 & 1 & 1] \\
[4 & 5 & 6] * [1 & 1 & 1]
\end{bmatrix} =
\begin{bmatrix}
[1 \times 1 & 2 \times 1 & 3 \times 1] \\
[4 \times 1 & 5 \times 1 & 6 \times 1]
\end{bmatrix}
=\begin{bmatrix}
1 \ 2 \ 3 \\
4 \ 5 \ 6
\end{bmatrix}\]
<p>And then, when we apply <code class="language-plaintext highlighter-rouge">sum</code>, the Python’s default <code class="language-plaintext highlighter-rouge">sum</code> function takes all the element in a NumPy array at once,
which
became
$1+2+ ..
.+ 5+6 = 21$. But what we want is to sum only elements in each row. So we need to find an alternative to <code class="language-plaintext highlighter-rouge">sum</code>.</p>
<p>Here comes <code class="language-plaintext highlighter-rouge">np
.sum</code> to rescue. When we pass the parameter <code class="language-plaintext highlighter-rouge">axis=1</code>, it sums elements across columns in the same row. The default
is
<code class="language-plaintext highlighter-rouge">axis=0</code>
which
sums elements across rows within the same column, so we need to make sure we pass <code class="language-plaintext highlighter-rouge">axis=1</code> parameter.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">c</span><span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">array</span><span class="p">([</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">15</span><span class="p">])</span>
</code></pre></div></div>
<p>Yes! This is what we expected.</p>
<p><a id="np.multiply"></a></p>
<h2 id="2-element-wise-multiplication-npmultiply-and-sum">(2) element-wise multiplication: np.multiply and sum</h2>
<p>Okay, then what about <code class="language-plaintext highlighter-rouge">np.multiply</code>? What does it do and is it different from <code class="language-plaintext highlighter-rouge">*</code>?</p>
<p><code class="language-plaintext highlighter-rouge">np.multiply</code> is basically the same as <code class="language-plaintext highlighter-rouge">*</code>. It is a <code class="language-plaintext highlighter-rouge">NumPy</code>’s version of element-wise
multiplication instead of Python’s native operator.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">])</span>
<span class="o">>>></span> <span class="n">c</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">multiply</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">c</span>
<span class="n">array</span><span class="p">([</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">18</span><span class="p">])</span>
<span class="o">>></span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">array</span><span class="p">([</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">15</span><span class="p">])</span>
</code></pre></div></div>
<p><a id="np.dot"></a></p>
<h2 id="3-dot-product-npdot">(3) dot product: np.dot</h2>
<p>Is there any option that we can avoid the additional line of <code class="language-plaintext highlighter-rouge">np.sum</code>? Yes, <code class="language-plaintext highlighter-rouge">np.dot</code> in NumPy! You
can use either <code class="language-plaintext highlighter-rouge">np.dot(a, b)</code> or <code class="language-plaintext highlighter-rouge">a.dot(b)</code> and
it <strong>takes care of both element multiplication and sum</strong>. Simple and easy.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">])</span>
<span class="o">>>></span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="mi">32</span>
</code></pre></div></div>
<p>Great! Dot product in just one line of code. If
the
dimension
of the array is 2D
or higher, make sure the number of columns of the first array matches up with the number of rows in the second array.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]])</span> <span class="c1"># shape (1, 3)
</span><span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">]])</span> <span class="c1"># shape (1, 3)
</span>
<span class="o">>>></span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="c1"># ValueError: shapes (1,3) and (1,3) not aligned: 3 (dim 1) != 1 (dim 0)
</span></code></pre></div></div>
<p>To make the above example work, you need to transpose the second array so that the shapes are aligned: (1, 3) x (3,
1). Note that this will return (1, 1), which is a 2D array.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]])</span> <span class="c1"># shape (1, 3)
</span><span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">]])</span> <span class="c1"># shape (1, 3)
</span>
<span class="o">>>></span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">.</span><span class="n">T</span><span class="p">)</span>
<span class="n">array</span><span class="p">([[</span><span class="mi">32</span><span class="p">]])</span>
</code></pre></div></div>
<p>As a side note, if you transpose the second array, you will get a (3 x 3) array, which is the outer product instead of
inner product (dot product). So, be make sure you transpose the right one.</p>
<p>Now let’s try a 2D x 2D example as well with the following example. Will it work even if it’s called <code class="language-plaintext highlighter-rouge">dot</code> product?</p>
\[\begin{bmatrix}
1, 2, 3 \\
4, 5, 6
\end{bmatrix}
\begin{bmatrix}
1 \\
1 \\
1 \\
\end{bmatrix} =
\begin{bmatrix}
1 \times 1 + 2 \times 1 + 3 \times 1 \\
4 \times 1 + 5 \times 1 + 6 \times 1 \\
\end{bmatrix} =
\begin{bmatrix}
6 \\
15 \\
\end{bmatrix}\]
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">c</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span> <span class="p">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">]])</span> <span class="c1"># shape (2, 3)
</span><span class="n">d</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">]])</span> <span class="c1"># shape (3, 1)
</span>
<span class="o">>>></span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">d</span><span class="p">)</span>
<span class="n">array</span><span class="p">([[</span> <span class="mi">6</span><span class="p">],</span>
<span class="p">[</span><span class="mi">15</span><span class="p">]])</span>
</code></pre></div></div>
<p>It works! Even if it is called <code class="language-plaintext highlighter-rouge">dot</code>, which indicates that the inputs are 1D vectors and the output is a scalar by
its definition, it works for 2D or higher dimensional matrices as if it was a matrix multiplication.</p>
<p>So, should we use <code class="language-plaintext highlighter-rouge">np.dot</code> for both dot product and matrix multiplication?</p>
<p>Technically yes but it is not recommended to use <code class="language-plaintext highlighter-rouge">np.dot</code> for matrix multiplication because the name dot product
has a specific meaning and it can be confusing to readers, especially mathematicians! <a href="https://blog
.finxter.com/numpy-matmul-operator/#Python_@_Operator">(Reference)</a> Also, it is not recommended for high dimensional matrices (3D
or above) because <code class="language-plaintext highlighter-rouge">np.dot</code> behaves different from
normal matrix multiplication. We will discuss this in the later of this post.</p>
<p>So, <strong><code class="language-plaintext highlighter-rouge">np.dot</code> works for both dot product and matrix multiplication but is recommended for dot product only.</strong></p>
<p><a id="np.matmul"></a></p>
<h2 id="4-matrix-multiplication-npmatmul">(4) matrix multiplication: np.matmul</h2>
<p>The next option is <code class="language-plaintext highlighter-rouge">np.matmul</code>. <strong>It is designed for matrix multiplication</strong> and even the name comes from it
(<strong>MAT</strong>rix <strong>MUL</strong>tiplication). Although the name says matrix multiplication, it also works in 1D array and can do
dot product
just like
<code class="language-plaintext highlighter-rouge">np.dot</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># 1D array
</span><span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])</span> <span class="c1"># shape (1, 3)
</span><span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">])</span> <span class="c1"># shape (1, 3)
</span>
<span class="o">>>></span> <span class="n">np</span><span class="p">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="mi">32</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># 2D array with values in 1 axis
</span><span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]])</span> <span class="c1"># shape (1, 3)
</span><span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">]])</span> <span class="c1"># shape (1, 3)
</span>
<span class="o">>>></span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">.</span><span class="n">T</span><span class="p">)</span>
<span class="n">array</span><span class="p">([[</span><span class="mi">32</span><span class="p">]])</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># two 2D arrays
</span><span class="n">c</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span> <span class="p">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">]])</span> <span class="c1"># shape (2, 3)
</span><span class="n">d</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">]])</span> <span class="c1"># shape (3, 1)
</span>
<span class="o">>>></span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">d</span><span class="p">)</span>
<span class="n">array</span><span class="p">([[</span> <span class="mi">6</span><span class="p">],</span>
<span class="p">[</span><span class="mi">15</span><span class="p">]])</span>
</code></pre></div></div>
<p>Nice! So, this means both <code class="language-plaintext highlighter-rouge">np.dot</code> and <code class="language-plaintext highlighter-rouge">np.matmul</code> work perfectly for dot product and matrix multiplication. However,
as we said before, <strong>it is recommended to use <code class="language-plaintext highlighter-rouge">np.dot</code> for dot product and <code class="language-plaintext highlighter-rouge">np.matmul</code> for 2D or higher matrix
multiplication</strong>.</p>
<p><a id="@"></a></p>
<h2 id="5-matrix-multiplication-">(5) matrix multiplication: @</h2>
<p>Here comes our last but not least option, <code class="language-plaintext highlighter-rouge">@</code>! <code class="language-plaintext highlighter-rouge">@</code>, pronounced as [at], is a new Python operator that was introduced
since
Python 3.5,
whose name comes
from m<strong>AT</strong>rices. <strong>It is basically the same as <code class="language-plaintext highlighter-rouge">np.matmul</code> and designed to perform matrix multiplication</strong>. But why do
we need a new infix if we already have <code class="language-plaintext highlighter-rouge">np.matmul</code> that works perfectly fine?</p>
<p>The major motivation for adding a new operator to stdlib was that the matrix multiplication is a so common operator that it deserves its own infix. For example, the operator <code class="language-plaintext highlighter-rouge">//</code> is much more uncommon than matrix multiplication but still has its own infix. To learn more about the background of this addition, check out this <a href="https://www.python.org/dev/peps/pep-0465/">PEP 465</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># 1D array
</span><span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])</span> <span class="c1"># shape (1, 3)
</span><span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">])</span> <span class="c1"># shape (1, 3)
</span>
<span class="o">>>></span> <span class="n">a</span> <span class="o">@</span> <span class="n">b</span>
<span class="mi">32</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># 2D array with values in 1 axis
</span><span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]])</span> <span class="c1"># shape (1, 3)
</span><span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">]])</span> <span class="c1"># shape (1, 3)
</span>
<span class="o">>>></span> <span class="n">a</span> <span class="o">@</span> <span class="n">b</span><span class="p">.</span><span class="n">T</span>
<span class="n">array</span><span class="p">([[</span><span class="mi">32</span><span class="p">]])</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># 2D arrays
</span><span class="n">c</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span> <span class="p">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">]])</span> <span class="c1"># shape: (2, 3)
</span><span class="n">d</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">]])</span> <span class="c1"># shape: (3, 1)
</span>
<span class="o">>>></span> <span class="n">c</span> <span class="o">@</span> <span class="n">d</span>
<span class="n">array</span><span class="p">([[</span> <span class="mi">6</span><span class="p">],</span>
<span class="p">[</span><span class="mi">15</span><span class="p">]])</span>
</code></pre></div></div>
<p>So, <strong><code class="language-plaintext highlighter-rouge">@</code> works exactly same as <code class="language-plaintext highlighter-rouge">np.matmul</code></strong>. But which one should you use between <code class="language-plaintext highlighter-rouge">np.matmul</code> and <code class="language-plaintext highlighter-rouge">@</code> then? Although
it is your preference, <code class="language-plaintext highlighter-rouge">@</code> looks cleaner than <code class="language-plaintext highlighter-rouge">np.matmul</code> in code. Let us see a case where have three matrices $x,
y, z$ to perform a matrix
multiplication.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># `np.matmul` version
</span><span class="n">np</span><span class="p">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">),</span> <span class="n">z</span><span class="p">)</span>
<span class="c1"># `@` version
</span><span class="n">x</span> <span class="o">@</span> <span class="n">y</span> <span class="o">@</span> <span class="n">z</span>
</code></pre></div></div>
<p>As you can see, <strong><code class="language-plaintext highlighter-rouge">@</code> is much cleaner and more readable. However, as it is available only Python 3.5+, you have to use
<code class="language-plaintext highlighter-rouge">np
.matmul</code> if you use an earlier Python version</strong>.</p>
<p><a id="dot_vs_matmul"></a></p>
<h2 id="3-so-whats-with-npdot-vs-npmatmul-">3. So.. what’s with np.dot vs. np.matmul (@)?</h2>
<p>In the above section, I mentioned that <code class="language-plaintext highlighter-rouge">np.dot</code> is not recommended for high dimensional arrays. What do I mean by
that?</p>
<p>There was an interesting <a href="https://stackoverflow.com/questions/34142485/difference-between-numpy-dot-and-python-3-5-matrix-multiplication">question</a> in stackoverflow about different behaviors between <code class="language-plaintext highlighter-rouge">np.dot</code> and <code class="language-plaintext highlighter-rouge">@</code>. Let’s looks at this.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># define input arrays
</span><span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span> <span class="c1"># 2 rows, 2 columns, in 3 layers
</span><span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span> <span class="c1"># 2 rows, 2 columns, in 3 layers
</span>
<span class="c1"># perform matrix multiplication
</span><span class="n">c</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">a</span> <span class="o">@</span> <span class="n">b</span> <span class="c1"># Python 3.5+
</span>
<span class="o">>>></span> <span class="n">c</span><span class="p">.</span><span class="n">shape</span> <span class="c1"># np.dot
</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">d</span><span class="p">.</span><span class="n">shape</span> <span class="c1"># @
</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
</code></pre></div></div>
<p>With the same inputs, we have completely different outputs - 4D array for <code class="language-plaintext highlighter-rouge">np.dot</code> and
3D
array for <code class="language-plaintext highlighter-rouge">@</code>.
What happened? This is
because of
the way <code class="language-plaintext highlighter-rouge">np.dot</code> and <code class="language-plaintext highlighter-rouge">@</code> are designed. Based on the their definition:</p>
<p>=======================<br />
For <code class="language-plaintext highlighter-rouge">matmul</code>:</p>
<blockquote>
<p>If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.</p>
</blockquote>
<p>For <code class="language-plaintext highlighter-rouge">np.dot</code>:</p>
<blockquote>
<p>For 2-D arrays it is equivalent to matrix multiplication, and for 1-D arrays to inner product of vectors (without complex conjugation). For N dimensions it is a sum product over the last axis of a and the second-to-last of b</p>
</blockquote>
<blockquote>
<p>If a is an N-D array and b is an M-D array (where M>=2), it is a sum product over the last axis of a and the second-to-last axis of b:</p>
</blockquote>
<blockquote>
<p>$ dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])$</p>
</blockquote>
<p>=======================</p>
<p><strong>Long story short, in the normal matrix multiplication situation where we want to treat each stack of matrices in the
last two indexes, we should use <code class="language-plaintext highlighter-rouge">matmul</code></strong>.</p>
<p><a id="summary"></a></p>
<h1 id="4-summary">4. Summary</h1>
<ul>
<li><code class="language-plaintext highlighter-rouge">*</code> == <code class="language-plaintext highlighter-rouge">np.multiply</code> != <code class="language-plaintext highlighter-rouge">np.dot</code> != <code class="language-plaintext highlighter-rouge">np.matmul</code> == <code class="language-plaintext highlighter-rouge">@</code></li>
<li><code class="language-plaintext highlighter-rouge">*</code> and <code class="language-plaintext highlighter-rouge">np.multiply</code> need <code class="language-plaintext highlighter-rouge">np.sum</code> to perform dot product. Not recommended for dot product or matrix multiplication.</li>
<li><code class="language-plaintext highlighter-rouge">np.dot</code> works for dot product and matrix multiplication. However, recommended to avoid using it for matrix multiplication due to the name.</li>
<li><code class="language-plaintext highlighter-rouge">np.matmul</code> and <code class="language-plaintext highlighter-rouge">@</code> are the same thing, designed to perform matrix multiplication. <code class="language-plaintext highlighter-rouge">@</code> is added to Python 3.5+ to give matrix multiplication its own infix.</li>
<li><code class="language-plaintext highlighter-rouge">np.dot</code> and <code class="language-plaintext highlighter-rouge">np.matmul</code> generally behave similarly other than 2 exceptions: 1) <code class="language-plaintext highlighter-rouge">matmul</code> doesn’t allow multiplication by scalar, 2) the calculation is done differently for N>2 dimesion. Check the documentation which one you intend to use.</li>
</ul>
<p>One line summary:</p>
<ul>
<li><strong>For dot product, use <code class="language-plaintext highlighter-rouge">np.dot</code>. For matrix multiplication, use <code class="language-plaintext highlighter-rouge">@</code> for Python 3.5 or above, and <code class="language-plaintext highlighter-rouge">np.matmul</code> for earlier Python versions.</strong></li>
</ul>
<p><a id="reference"></a></p>
<h1 id="5-reference">5. Reference</h1>
<ul>
<li><a href="https://blog.finxter.com/numpy-matmul-operator/">NumPy Matrix Multiplication — np.matmul() and @</a></li>
<li><a href="https://numpy.org/doc/stable/reference/generated/numpy.dot.html">numpy.dot official document</a></li>
<li><a href="https://www.python.org/dev/peps/pep-0465/">PEP 465 – A dedicated infix operator for matrix multiplication</a></li>
<li><a href="https://stackoverflow.com/questions/34142485/difference-between-numpy-dot-and-python-3-5-matrix-multiplication">Difference between numpy dot() and Python 3.5+ matrix multiplication @</a></li>
<li><a href="https://en.wikipedia.org/wiki/Dot_product">Wikipedia</a></li>
</ul>When I first implemented gradient descent from scratch a few years ago, I was very confused which method to use for dot product and matrix multiplications - np.multiply or np.dot or np.matmul? And after a few years, it turns out that… I am still confused! So, I decided to investigate all the options in Python and NumPy (*, np.multiply, np.dot, np.matmul, and @), come up with the best approach to take, and document the findings here.What Are Dot Product and Matrix Multiplication?2020-08-23T19:56:00+00:002020-08-23T19:56:00+00:00/python/2020/08/23/dot-product<p><a id="dot_product"></a></p>
<h1 id="1-what-is-dot-prodcut">1. What is dot prodcut?</h1>
<p>The dot product is an algebraic operation that takes <strong>two same-sized vectors</strong> and returns <strong>a single number</strong>.</p>
<p><strong>Algebraic definition</strong><br />
For two sequences of numbers, the dot product is the sum of the products of corresponding components of them. Think
of two sequences $a$ and $b$ as below.</p>
\[a =
\begin{bmatrix}
a_1 & a_2 & ... & a_n
\end{bmatrix} \\
b =
\begin{bmatrix}
b_1 & b_2 & ... & b_n
\end{bmatrix} \\\]
<p>Then, the dot product of a and b becomes</p>
\[a \cdot b = \sum_{i=1}^{n} a_i b_i\]
<p>If $a$ and $b$ are row matrices, the dot product can be written as a matrix product.
\(a \cdot b = ab^\intercal\)</p>
<p>For example, if $a = [a_1 \ a_2 \ a_3]$ and $b = [b_1 \ b_2 \ b_3]$, it becomes</p>
\[[a_1 \ a_2 \ a_3]
\begin{bmatrix}
b_1 \\
b_2 \\
b_3
\end{bmatrix}
=a_1b_1 + a_2b_2 + a_3b_3\]
<p><strong>Geometric definition</strong><br />
Geometrically, the dot product is the product of the Euclidean magnitudes of two vectors and the cosine of the angle between two.</p>
\[a \cdot b = \vert a \vert \vert b \vert \rm cos \theta\]
<p>Note that it is based on how much of one vector is in the direction of the other (projection). For example, in
the below figure, the component of $A$ that is in the $B$ direction is $\vert A \vert \rm cos \theta$. Here, the
magnitude of $A$ can be calculated by $\vert A \vert = \sqrt{x^2 + y^2}$ if $A = (x, y)$ and the initial point is
the origin.</p>
<div style="text-align:center"><img src="/images/python/Dot_Product.svg" />
<figcaption>Fig 1. Projection of A onto B<a href="https://en.wikipedia.org/wiki/Dot_product"> (Wikipedia)
</a></figcaption>
</div>
<p><br /></p>
<p>Also note that if the two vectors are in the same direction, $\rm cos \theta = \rm cos 0^{\circ} = 1$ so it simply
becomes the product of the magnitude of the two vectors, $a \cdot b = \vert a \vert \vert b \vert$. On the other hand,
if the two vectors are perpendicular, the whole dot product becomes 0 because $\rm cos \theta = \rm cos 90^{\circ} =
0$.</p>
<p><strong>Real world example</strong><br />
So what does the dot product really mean to us? How can we use it in the real life?<br />
Imagine you are in a grocery store. You want to buy 1 apple, 2 oranges, and 3 bananas. The unit prices are \$1, \$2, \$0.5, respectively.</p>
<div style="text-align:center">
<img src="/images/python/apple_orange_banana.jpg" alt="drawing" width="300" />
<figcaption>Fig 2. Apple, orange, and banana
<a href="https://www.thestar
.com/life/food_wine/2013/11/04/apples_oranges_or_bananas_which_fruit_is_nutritionally_the_best.html">(image source)</a>
</figcaption>
</div>
<p><br /></p>
<p>You can define a number of items vector ($a$) and a unit price vector ($b$).</p>
\[a = \begin{bmatrix}1 & 2 & 3 \end{bmatrix}\\
b = \begin{bmatrix}\$1 & \$2 & \$0.5\end{bmatrix}\\\]
<p>The total cost will be the dot product of the two vectors:</p>
\[ab^\intercal =
\begin{bmatrix}
1 & 2 & 3
\end{bmatrix}
\begin{bmatrix}
\$1 \\
\$2 \\
\$0.5
\end{bmatrix}
=1 \times \$1 + 2 \times \$2 + 3 \times \$0.5 = \$6.5 \\\]
<p>Ta-da! Your total is $6.5! Now we understand the dot product is something useful in our life, right?</p>
<p><a id="matrix_multiplication"></a></p>
<h1 id="2-what-is-matrix-multiplication">2. What is matrix multiplication?</h1>
<p>Now that we know what the dot product is, let’s talk about matrix multiplication. How is it different from dot
product?</p>
<p>Matrix multiplication is basically a matrix version of the dot product. Remember the result of dot product is a
scalar. The <strong>result of matrix multiplication is a matrix</strong>, whose elements are the dot products of pairs of vectors in
each matrix.</p>
<div style="text-align:center">
<img src="/images/python/khan_academy_matrix_product.png" />
<figcaption>
Fig 3. Matrix multiplication
<a href="https://ml-cheatsheet.readthedocs.io/en/latest/linear_algebra.html">(image source)</a>
</figcaption>
</div>
<p><br /></p>
<p>Note that the number of columns in $A$ and the number of rows in $B$ should match; $A: (m \times n)$, $B: (n \times k)
$.</p>
<p><strong>Grocery example</strong><br />
Let’s go back to the previous grocery store example. Now there are two people who want to buy different numbers of apples, oranges, and bananas.</p>
<p>Person 1 wants 1 of each fruit: $a_1 = [1 \ \ 1 \ \ 1]$<br />
Person 2 wants 10 of each fruit: $a_2 = [10 \ \ 10 \ \ 10]$</p>
<p>How much should each person pay? Can we repeat the dot product? Absolutely! But instead of doing the dot product
twice, we can stack up
the vectors to build a matrix and that’s simply the matrix multiplication!</p>
<p>The number of apples, oranges, and bananas to buy:</p>
\[A=
\begin{bmatrix}
a_1\\
a_2
\end{bmatrix}=
\begin{bmatrix}
1 & 1 & 1\\
10 & 10 & 10\\
\end{bmatrix}\]
<p>Now, for the unit price vector $b$, we need to transpose b to make it a column
vector.</p>
\[B =
\begin{bmatrix}
\$1\\
\$2\\
\$0.5
\end{bmatrix}\]
<p>Now the total price each person has to pay is:</p>
\[A \cdot B =
\begin{bmatrix}
1 & 1 & 1\\
10 & 10 & 10
\end{bmatrix}
\begin{bmatrix}
\$1\\
\$2\\
\$0.5
\end{bmatrix} =
\begin{bmatrix}
1 \times \$1 + 1 \times \$2 + 1 \times \$0.5 \\
10 \times \$1 + 10 \times \$2 + 10 \times \$0.5
\end{bmatrix} =
\begin{bmatrix}
\$3.5 \\
\$35
\end{bmatrix}\]
<p>YAY :tada:! With just one simple matrix multiplication, we came up with that person 1 should pay \$3.5 and person 2
should
pay \$35! You will now use matrix multiplication when you go to a grocery shopping, right? :wink:</p>1. What is dot prodcut?