imputers

# Libraries ```python import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import sklearn from sklearn.impute import SimpleImputer, KNNImputer ``` # Load data ```python df = sns.load_dataset('mpg') df.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>mpg</th> <th>cylinders</th> <th>displacement</th> <th>horsepower</th> <th>weight</th> <th>acceleration</th> <th>model_year</th> <th>origin</th> <th>name</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>18.0</td> <td>8</td> <td>307.0</td> <td>130.0</td> <td>3504</td> <td>12.0</td> <td>70</td> <td>usa</td> <td>chevrolet chevelle malibu</td> </tr> <tr> <th>1</th> <td>15.0</td> <td>8</td> <td>350.0</td> <td>165.0</td> <td>3693</td> <td>11.5</td> <td>70</td> <td>usa</td> <td>buick skylark 320</td> </tr> <tr> <th>2</th> <td>18.0</td> <td>8</td> <td>318.0</td> <td>150.0</td> <td>3436</td> <td>11.0</td> <td>70</td> <td>usa</td> <td>plymouth satellite</td> </tr> <tr> <th>3</th> <td>16.0</td> <td>8</td> <td>304.0</td> <td>150.0</td> <td>3433</td> <td>12.0</td> <td>70</td> <td>usa</td> <td>amc rebel sst</td> </tr> <tr> <th>4</th> <td>17.0</td> <td>8</td> <td>302.0</td> <td>140.0</td> <td>3449</td> <td>10.5</td> <td>70</td> <td>usa</td> <td>ford torino</td> </tr> </tbody> </table> </div> # NaNs ```python df[df.isna().any(axis=1)] ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>mpg</th> <th>cylinders</th> <th>displacement</th> <th>horsepower</th> <th>weight</th> <th>acceleration</th> <th>model_year</th> <th>origin</th> <th>name</th> </tr> </thead> <tbody> <tr> <th>32</th> <td>25.0</td> <td>4</td> <td>98.0</td> <td>NaN</td> <td>2046</td> <td>19.0</td> <td>71</td> <td>usa</td> <td>ford pinto</td> </tr> <tr> <th>126</th> <td>21.0</td> <td>6</td> <td>200.0</td> <td>NaN</td> <td>2875</td> <td>17.0</td> <td>74</td> <td>usa</td> <td>ford maverick</td> </tr> <tr> <th>330</th> <td>40.9</td> <td>4</td> <td>85.0</td> <td>NaN</td> <td>1835</td> <td>17.3</td> <td>80</td> <td>europe</td> <td>renault lecar deluxe</td> </tr> <tr> <th>336</th> <td>23.6</td> <td>4</td> <td>140.0</td> <td>NaN</td> <td>2905</td> <td>14.3</td> <td>80</td> <td>usa</td> <td>ford mustang cobra</td> </tr> <tr> <th>354</th> <td>34.5</td> <td>4</td> <td>100.0</td> <td>NaN</td> <td>2320</td> <td>15.8</td> <td>81</td> <td>europe</td> <td>renault 18i</td> </tr> <tr> <th>374</th> <td>23.0</td> <td>4</td> <td>151.0</td> <td>NaN</td> <td>3035</td> <td>20.5</td> <td>82</td> <td>usa</td> <td>amc concord dl</td> </tr> </tbody> </table> </div> --> 6 lignes avec des valeurs manquantes, toutes dans la colonne horsepower ```python na_index = df[df.isna().any(axis=1)].index ``` # SimpleImputer ```python imputer = SimpleImputer(strategy='most_frequent') imputer.fit(df) pd.DataFrame(imputer.transform(df), columns=df.columns) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>mpg</th> <th>cylinders</th> <th>displacement</th> <th>horsepower</th> <th>weight</th> <th>acceleration</th> <th>model_year</th> <th>origin</th> <th>name</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>18.0</td> <td>8</td> <td>307.0</td> <td>130.0</td> <td>3504</td> <td>12.0</td> <td>70</td> <td>usa</td> <td>chevrolet chevelle malibu</td> </tr> <tr> <th>1</th> <td>15.0</td> <td>8</td> <td>350.0</td> <td>165.0</td> <td>3693</td> <td>11.5</td> <td>70</td> <td>usa</td> <td>buick skylark 320</td> </tr> <tr> <th>2</th> <td>18.0</td> <td>8</td> <td>318.0</td> <td>150.0</td> <td>3436</td> <td>11.0</td> <td>70</td> <td>usa</td> <td>plymouth satellite</td> </tr> <tr> <th>3</th> <td>16.0</td> <td>8</td> <td>304.0</td> <td>150.0</td> <td>3433</td> <td>12.0</td> <td>70</td> <td>usa</td> <td>amc rebel sst</td> </tr> <tr> <th>4</th> <td>17.0</td> <td>8</td> <td>302.0</td> <td>140.0</td> <td>3449</td> <td>10.5</td> <td>70</td> <td>usa</td> <td>ford torino</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>393</th> <td>27.0</td> <td>4</td> <td>140.0</td> <td>86.0</td> <td>2790</td> <td>15.6</td> <td>82</td> <td>usa</td> <td>ford mustang gl</td> </tr> <tr> <th>394</th> <td>44.0</td> <td>4</td> <td>97.0</td> <td>52.0</td> <td>2130</td> <td>24.6</td> <td>82</td> <td>europe</td> <td>vw pickup</td> </tr> <tr> <th>395</th> <td>32.0</td> <td>4</td> <td>135.0</td> <td>84.0</td> <td>2295</td> <td>11.6</td> <td>82</td> <td>usa</td> <td>dodge rampage</td> </tr> <tr> <th>396</th> <td>28.0</td> <td>4</td> <td>120.0</td> <td>79.0</td> <td>2625</td> <td>18.6</td> <td>82</td> <td>usa</td> <td>ford ranger</td> </tr> <tr> <th>397</th> <td>31.0</td> <td>4</td> <td>119.0</td> <td>82.0</td> <td>2720</td> <td>19.4</td> <td>82</td> <td>usa</td> <td>chevy s-10</td> </tr> </tbody> </table> <p>398 rows × 9 columns</p> </div> Par quelles valeurs on été remplacées les valeurs manquantes? ```python pd.DataFrame(imputer.transform(df), columns=df.columns).iloc[na_index, :] ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>mpg</th> <th>cylinders</th> <th>displacement</th> <th>horsepower</th> <th>weight</th> <th>acceleration</th> <th>model_year</th> <th>origin</th> <th>name</th> </tr> </thead> <tbody> <tr> <th>32</th> <td>25.0</td> <td>4</td> <td>98.0</td> <td>150.0</td> <td>2046</td> <td>19.0</td> <td>71</td> <td>usa</td> <td>ford pinto</td> </tr> <tr> <th>126</th> <td>21.0</td> <td>6</td> <td>200.0</td> <td>150.0</td> <td>2875</td> <td>17.0</td> <td>74</td> <td>usa</td> <td>ford maverick</td> </tr> <tr> <th>330</th> <td>40.9</td> <td>4</td> <td>85.0</td> <td>150.0</td> <td>1835</td> <td>17.3</td> <td>80</td> <td>europe</td> <td>renault lecar deluxe</td> </tr> <tr> <th>336</th> <td>23.6</td> <td>4</td> <td>140.0</td> <td>150.0</td> <td>2905</td> <td>14.3</td> <td>80</td> <td>usa</td> <td>ford mustang cobra</td> </tr> <tr> <th>354</th> <td>34.5</td> <td>4</td> <td>100.0</td> <td>150.0</td> <td>2320</td> <td>15.8</td> <td>81</td> <td>europe</td> <td>renault 18i</td> </tr> <tr> <th>374</th> <td>23.0</td> <td>4</td> <td>151.0</td> <td>150.0</td> <td>3035</td> <td>20.5</td> <td>82</td> <td>usa</td> <td>amc concord dl</td> </tr> </tbody> </table> </div> Si on veut remplacer les valeurs manquantes par la moyenne il faut commencer par sélectionner toutes les valeurs numériques ```python df_numeric = df.select_dtypes(include="number") imputer = SimpleImputer(strategy='mean') imputer.fit(df_numeric) pd.DataFrame(imputer.transform(df_numeric), columns=df_numeric.columns).iloc[na_index, :] ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>mpg</th> <th>cylinders</th> <th>displacement</th> <th>horsepower</th> <th>weight</th> <th>acceleration</th> <th>model_year</th> </tr> </thead> <tbody> <tr> <th>32</th> <td>25.0</td> <td>4.0</td> <td>98.0</td> <td>104.469388</td> <td>2046.0</td> <td>19.0</td> <td>71.0</td> </tr> <tr> <th>126</th> <td>21.0</td> <td>6.0</td> <td>200.0</td> <td>104.469388</td> <td>2875.0</td> <td>17.0</td> <td>74.0</td> </tr> <tr> <th>330</th> <td>40.9</td> <td>4.0</td> <td>85.0</td> <td>104.469388</td> <td>1835.0</td> <td>17.3</td> <td>80.0</td> </tr> <tr> <th>336</th> <td>23.6</td> <td>4.0</td> <td>140.0</td> <td>104.469388</td> <td>2905.0</td> <td>14.3</td> <td>80.0</td> </tr> <tr> <th>354</th> <td>34.5</td> <td>4.0</td> <td>100.0</td> <td>104.469388</td> <td>2320.0</td> <td>15.8</td> <td>81.0</td> </tr> <tr> <th>374</th> <td>23.0</td> <td>4.0</td> <td>151.0</td> <td>104.469388</td> <td>3035.0</td> <td>20.5</td> <td>82.0</td> </tr> </tbody> </table> </div> # KNNImputer ```python imputer = KNNImputer(n_neighbors=5) imputer.fit(df_numeric) pd.DataFrame(imputer.transform(df_numeric), columns=df_numeric.columns).iloc[na_index, :] ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>mpg</th> <th>cylinders</th> <th>displacement</th> <th>horsepower</th> <th>weight</th> <th>acceleration</th> <th>model_year</th> </tr> </thead> <tbody> <tr> <th>32</th> <td>25.0</td> <td>4.0</td> <td>98.0</td> <td>62.0</td> <td>2046.0</td> <td>19.0</td> <td>71.0</td> </tr> <tr> <th>126</th> <td>21.0</td> <td>6.0</td> <td>200.0</td> <td>107.6</td> <td>2875.0</td> <td>17.0</td> <td>74.0</td> </tr> <tr> <th>330</th> <td>40.9</td> <td>4.0</td> <td>85.0</td> <td>64.6</td> <td>1835.0</td> <td>17.3</td> <td>80.0</td> </tr> <tr> <th>336</th> <td>23.6</td> <td>4.0</td> <td>140.0</td> <td>112.8</td> <td>2905.0</td> <td>14.3</td> <td>80.0</td> </tr> <tr> <th>354</th> <td>34.5</td> <td>4.0</td> <td>100.0</td> <td>76.0</td> <td>2320.0</td> <td>15.8</td> <td>81.0</td> </tr> <tr> <th>374</th> <td>23.0</td> <td>4.0</td> <td>151.0</td> <td>88.2</td> <td>3035.0</td> <td>20.5</td> <td>82.0</td> </tr> </tbody> </table> </div>