# Libraries
```python
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from sklearn.impute import SimpleImputer, KNNImputer
```
# Load data
```python
df = sns.load_dataset('mpg')
df.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>mpg</th>
<th>cylinders</th>
<th>displacement</th>
<th>horsepower</th>
<th>weight</th>
<th>acceleration</th>
<th>model_year</th>
<th>origin</th>
<th>name</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>18.0</td>
<td>8</td>
<td>307.0</td>
<td>130.0</td>
<td>3504</td>
<td>12.0</td>
<td>70</td>
<td>usa</td>
<td>chevrolet chevelle malibu</td>
</tr>
<tr>
<th>1</th>
<td>15.0</td>
<td>8</td>
<td>350.0</td>
<td>165.0</td>
<td>3693</td>
<td>11.5</td>
<td>70</td>
<td>usa</td>
<td>buick skylark 320</td>
</tr>
<tr>
<th>2</th>
<td>18.0</td>
<td>8</td>
<td>318.0</td>
<td>150.0</td>
<td>3436</td>
<td>11.0</td>
<td>70</td>
<td>usa</td>
<td>plymouth satellite</td>
</tr>
<tr>
<th>3</th>
<td>16.0</td>
<td>8</td>
<td>304.0</td>
<td>150.0</td>
<td>3433</td>
<td>12.0</td>
<td>70</td>
<td>usa</td>
<td>amc rebel sst</td>
</tr>
<tr>
<th>4</th>
<td>17.0</td>
<td>8</td>
<td>302.0</td>
<td>140.0</td>
<td>3449</td>
<td>10.5</td>
<td>70</td>
<td>usa</td>
<td>ford torino</td>
</tr>
</tbody>
</table>
</div>
# NaNs
```python
df[df.isna().any(axis=1)]
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>mpg</th>
<th>cylinders</th>
<th>displacement</th>
<th>horsepower</th>
<th>weight</th>
<th>acceleration</th>
<th>model_year</th>
<th>origin</th>
<th>name</th>
</tr>
</thead>
<tbody>
<tr>
<th>32</th>
<td>25.0</td>
<td>4</td>
<td>98.0</td>
<td>NaN</td>
<td>2046</td>
<td>19.0</td>
<td>71</td>
<td>usa</td>
<td>ford pinto</td>
</tr>
<tr>
<th>126</th>
<td>21.0</td>
<td>6</td>
<td>200.0</td>
<td>NaN</td>
<td>2875</td>
<td>17.0</td>
<td>74</td>
<td>usa</td>
<td>ford maverick</td>
</tr>
<tr>
<th>330</th>
<td>40.9</td>
<td>4</td>
<td>85.0</td>
<td>NaN</td>
<td>1835</td>
<td>17.3</td>
<td>80</td>
<td>europe</td>
<td>renault lecar deluxe</td>
</tr>
<tr>
<th>336</th>
<td>23.6</td>
<td>4</td>
<td>140.0</td>
<td>NaN</td>
<td>2905</td>
<td>14.3</td>
<td>80</td>
<td>usa</td>
<td>ford mustang cobra</td>
</tr>
<tr>
<th>354</th>
<td>34.5</td>
<td>4</td>
<td>100.0</td>
<td>NaN</td>
<td>2320</td>
<td>15.8</td>
<td>81</td>
<td>europe</td>
<td>renault 18i</td>
</tr>
<tr>
<th>374</th>
<td>23.0</td>
<td>4</td>
<td>151.0</td>
<td>NaN</td>
<td>3035</td>
<td>20.5</td>
<td>82</td>
<td>usa</td>
<td>amc concord dl</td>
</tr>
</tbody>
</table>
</div>
--> 6 lignes avec des valeurs manquantes, toutes dans la colonne horsepower
```python
na_index = df[df.isna().any(axis=1)].index
```
# SimpleImputer
```python
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit(df)
pd.DataFrame(imputer.transform(df), columns=df.columns)
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>mpg</th>
<th>cylinders</th>
<th>displacement</th>
<th>horsepower</th>
<th>weight</th>
<th>acceleration</th>
<th>model_year</th>
<th>origin</th>
<th>name</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>18.0</td>
<td>8</td>
<td>307.0</td>
<td>130.0</td>
<td>3504</td>
<td>12.0</td>
<td>70</td>
<td>usa</td>
<td>chevrolet chevelle malibu</td>
</tr>
<tr>
<th>1</th>
<td>15.0</td>
<td>8</td>
<td>350.0</td>
<td>165.0</td>
<td>3693</td>
<td>11.5</td>
<td>70</td>
<td>usa</td>
<td>buick skylark 320</td>
</tr>
<tr>
<th>2</th>
<td>18.0</td>
<td>8</td>
<td>318.0</td>
<td>150.0</td>
<td>3436</td>
<td>11.0</td>
<td>70</td>
<td>usa</td>
<td>plymouth satellite</td>
</tr>
<tr>
<th>3</th>
<td>16.0</td>
<td>8</td>
<td>304.0</td>
<td>150.0</td>
<td>3433</td>
<td>12.0</td>
<td>70</td>
<td>usa</td>
<td>amc rebel sst</td>
</tr>
<tr>
<th>4</th>
<td>17.0</td>
<td>8</td>
<td>302.0</td>
<td>140.0</td>
<td>3449</td>
<td>10.5</td>
<td>70</td>
<td>usa</td>
<td>ford torino</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>393</th>
<td>27.0</td>
<td>4</td>
<td>140.0</td>
<td>86.0</td>
<td>2790</td>
<td>15.6</td>
<td>82</td>
<td>usa</td>
<td>ford mustang gl</td>
</tr>
<tr>
<th>394</th>
<td>44.0</td>
<td>4</td>
<td>97.0</td>
<td>52.0</td>
<td>2130</td>
<td>24.6</td>
<td>82</td>
<td>europe</td>
<td>vw pickup</td>
</tr>
<tr>
<th>395</th>
<td>32.0</td>
<td>4</td>
<td>135.0</td>
<td>84.0</td>
<td>2295</td>
<td>11.6</td>
<td>82</td>
<td>usa</td>
<td>dodge rampage</td>
</tr>
<tr>
<th>396</th>
<td>28.0</td>
<td>4</td>
<td>120.0</td>
<td>79.0</td>
<td>2625</td>
<td>18.6</td>
<td>82</td>
<td>usa</td>
<td>ford ranger</td>
</tr>
<tr>
<th>397</th>
<td>31.0</td>
<td>4</td>
<td>119.0</td>
<td>82.0</td>
<td>2720</td>
<td>19.4</td>
<td>82</td>
<td>usa</td>
<td>chevy s-10</td>
</tr>
</tbody>
</table>
<p>398 rows × 9 columns</p>
</div>
Par quelles valeurs on été remplacées les valeurs manquantes?
```python
pd.DataFrame(imputer.transform(df), columns=df.columns).iloc[na_index, :]
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>mpg</th>
<th>cylinders</th>
<th>displacement</th>
<th>horsepower</th>
<th>weight</th>
<th>acceleration</th>
<th>model_year</th>
<th>origin</th>
<th>name</th>
</tr>
</thead>
<tbody>
<tr>
<th>32</th>
<td>25.0</td>
<td>4</td>
<td>98.0</td>
<td>150.0</td>
<td>2046</td>
<td>19.0</td>
<td>71</td>
<td>usa</td>
<td>ford pinto</td>
</tr>
<tr>
<th>126</th>
<td>21.0</td>
<td>6</td>
<td>200.0</td>
<td>150.0</td>
<td>2875</td>
<td>17.0</td>
<td>74</td>
<td>usa</td>
<td>ford maverick</td>
</tr>
<tr>
<th>330</th>
<td>40.9</td>
<td>4</td>
<td>85.0</td>
<td>150.0</td>
<td>1835</td>
<td>17.3</td>
<td>80</td>
<td>europe</td>
<td>renault lecar deluxe</td>
</tr>
<tr>
<th>336</th>
<td>23.6</td>
<td>4</td>
<td>140.0</td>
<td>150.0</td>
<td>2905</td>
<td>14.3</td>
<td>80</td>
<td>usa</td>
<td>ford mustang cobra</td>
</tr>
<tr>
<th>354</th>
<td>34.5</td>
<td>4</td>
<td>100.0</td>
<td>150.0</td>
<td>2320</td>
<td>15.8</td>
<td>81</td>
<td>europe</td>
<td>renault 18i</td>
</tr>
<tr>
<th>374</th>
<td>23.0</td>
<td>4</td>
<td>151.0</td>
<td>150.0</td>
<td>3035</td>
<td>20.5</td>
<td>82</td>
<td>usa</td>
<td>amc concord dl</td>
</tr>
</tbody>
</table>
</div>
Si on veut remplacer les valeurs manquantes par la moyenne il faut commencer par sélectionner toutes les valeurs numériques
```python
df_numeric = df.select_dtypes(include="number")
imputer = SimpleImputer(strategy='mean')
imputer.fit(df_numeric)
pd.DataFrame(imputer.transform(df_numeric), columns=df_numeric.columns).iloc[na_index, :]
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>mpg</th>
<th>cylinders</th>
<th>displacement</th>
<th>horsepower</th>
<th>weight</th>
<th>acceleration</th>
<th>model_year</th>
</tr>
</thead>
<tbody>
<tr>
<th>32</th>
<td>25.0</td>
<td>4.0</td>
<td>98.0</td>
<td>104.469388</td>
<td>2046.0</td>
<td>19.0</td>
<td>71.0</td>
</tr>
<tr>
<th>126</th>
<td>21.0</td>
<td>6.0</td>
<td>200.0</td>
<td>104.469388</td>
<td>2875.0</td>
<td>17.0</td>
<td>74.0</td>
</tr>
<tr>
<th>330</th>
<td>40.9</td>
<td>4.0</td>
<td>85.0</td>
<td>104.469388</td>
<td>1835.0</td>
<td>17.3</td>
<td>80.0</td>
</tr>
<tr>
<th>336</th>
<td>23.6</td>
<td>4.0</td>
<td>140.0</td>
<td>104.469388</td>
<td>2905.0</td>
<td>14.3</td>
<td>80.0</td>
</tr>
<tr>
<th>354</th>
<td>34.5</td>
<td>4.0</td>
<td>100.0</td>
<td>104.469388</td>
<td>2320.0</td>
<td>15.8</td>
<td>81.0</td>
</tr>
<tr>
<th>374</th>
<td>23.0</td>
<td>4.0</td>
<td>151.0</td>
<td>104.469388</td>
<td>3035.0</td>
<td>20.5</td>
<td>82.0</td>
</tr>
</tbody>
</table>
</div>
# KNNImputer
```python
imputer = KNNImputer(n_neighbors=5)
imputer.fit(df_numeric)
pd.DataFrame(imputer.transform(df_numeric), columns=df_numeric.columns).iloc[na_index, :]
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>mpg</th>
<th>cylinders</th>
<th>displacement</th>
<th>horsepower</th>
<th>weight</th>
<th>acceleration</th>
<th>model_year</th>
</tr>
</thead>
<tbody>
<tr>
<th>32</th>
<td>25.0</td>
<td>4.0</td>
<td>98.0</td>
<td>62.0</td>
<td>2046.0</td>
<td>19.0</td>
<td>71.0</td>
</tr>
<tr>
<th>126</th>
<td>21.0</td>
<td>6.0</td>
<td>200.0</td>
<td>107.6</td>
<td>2875.0</td>
<td>17.0</td>
<td>74.0</td>
</tr>
<tr>
<th>330</th>
<td>40.9</td>
<td>4.0</td>
<td>85.0</td>
<td>64.6</td>
<td>1835.0</td>
<td>17.3</td>
<td>80.0</td>
</tr>
<tr>
<th>336</th>
<td>23.6</td>
<td>4.0</td>
<td>140.0</td>
<td>112.8</td>
<td>2905.0</td>
<td>14.3</td>
<td>80.0</td>
</tr>
<tr>
<th>354</th>
<td>34.5</td>
<td>4.0</td>
<td>100.0</td>
<td>76.0</td>
<td>2320.0</td>
<td>15.8</td>
<td>81.0</td>
</tr>
<tr>
<th>374</th>
<td>23.0</td>
<td>4.0</td>
<td>151.0</td>
<td>88.2</td>
<td>3035.0</td>
<td>20.5</td>
<td>82.0</td>
</tr>
</tbody>
</table>
</div>