exo_encoding

# Librairies ```python import numpy as np import pandas as pd import seaborn as sns from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, LabelBinarizer, OneHotEncoder from sklearn import set_config set_config(transform_output="pandas") # la sortie de la transformation est un DataFrame pandas ``` # Exercices Transformer tous les Datasets en format numérique ### 1. Tips ```python df = sns.load_dataset("tips") # Target : tip df.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>total_bill</th> <th>tip</th> <th>sex</th> <th>smoker</th> <th>day</th> <th>time</th> <th>size</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>16.99</td> <td>1.01</td> <td>Female</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>2</td> </tr> <tr> <th>1</th> <td>10.34</td> <td>1.66</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>3</td> </tr> <tr> <th>2</th> <td>21.01</td> <td>3.50</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>3</td> </tr> <tr> <th>3</th> <td>23.68</td> <td>3.31</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>2</td> </tr> <tr> <th>4</th> <td>24.59</td> <td>3.61</td> <td>Female</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>4</td> </tr> </tbody> </table> </div> ```python df["sex"].unique() ``` ['Female', 'Male'] Categories (2, object): ['Male', 'Female'] ```python df["smoker"].unique() ``` ['No', 'Yes'] Categories (2, object): ['Yes', 'No'] ```python df["day"].unique() ``` ['Sun', 'Sat', 'Thur', 'Fri'] Categories (4, object): ['Thur', 'Fri', 'Sat', 'Sun'] ```python df["time"].unique() ``` ['Dinner', 'Lunch'] Categories (2, object): ['Lunch', 'Dinner'] Toutes les variables catégorielles sont nominales (et pas ordinales) --> OneHotEncoder ```python categories = ["sex", "smoker", "day", "time"] encoder = OneHotEncoder(sparse_output=False, drop="first") encoder.fit_transform(df[categories]) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>sex_Male</th> <th>smoker_Yes</th> <th>day_Sat</th> <th>day_Sun</th> <th>day_Thur</th> <th>time_Lunch</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0.0</td> <td>0.0</td> <td>0.0</td> <td>1.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>1</th> <td>1.0</td> <td>0.0</td> <td>0.0</td> <td>1.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>2</th> <td>1.0</td> <td>0.0</td> <td>0.0</td> <td>1.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>3</th> <td>1.0</td> <td>0.0</td> <td>0.0</td> <td>1.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>4</th> <td>0.0</td> <td>0.0</td> <td>0.0</td> <td>1.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>239</th> <td>1.0</td> <td>0.0</td> <td>1.0</td> <td>0.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>240</th> <td>0.0</td> <td>1.0</td> <td>1.0</td> <td>0.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>241</th> <td>1.0</td> <td>1.0</td> <td>1.0</td> <td>0.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>242</th> <td>1.0</td> <td>0.0</td> <td>1.0</td> <td>0.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>243</th> <td>0.0</td> <td>0.0</td> <td>0.0</td> <td>0.0</td> <td>1.0</td> <td>0.0</td> </tr> </tbody> </table> <p>244 rows × 6 columns</p> </div> ### 2. Penguins ```python df = sns.load_dataset("penguins") # Target : species df.dropna(inplace=True) df.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>species</th> <th>island</th> <th>bill_length_mm</th> <th>bill_depth_mm</th> <th>flipper_length_mm</th> <th>body_mass_g</th> <th>sex</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>Adelie</td> <td>Torgersen</td> <td>39.1</td> <td>18.7</td> <td>181.0</td> <td>3750.0</td> <td>Male</td> </tr> <tr> <th>1</th> <td>Adelie</td> <td>Torgersen</td> <td>39.5</td> <td>17.4</td> <td>186.0</td> <td>3800.0</td> <td>Female</td> </tr> <tr> <th>2</th> <td>Adelie</td> <td>Torgersen</td> <td>40.3</td> <td>18.0</td> <td>195.0</td> <td>3250.0</td> <td>Female</td> </tr> <tr> <th>4</th> <td>Adelie</td> <td>Torgersen</td> <td>36.7</td> <td>19.3</td> <td>193.0</td> <td>3450.0</td> <td>Female</td> </tr> <tr> <th>5</th> <td>Adelie</td> <td>Torgersen</td> <td>39.3</td> <td>20.6</td> <td>190.0</td> <td>3650.0</td> <td>Male</td> </tr> </tbody> </table> </div> ```python df["species"].unique() ``` array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object) ```python df["island"].unique() ``` array(['Torgersen', 'Biscoe', 'Dream'], dtype=object) ```python df["sex"].unique() ``` array(['Male', 'Female'], dtype=object) - Toutes les variables catégorielles sont nominales - OneHotEncoder pour les variables - LabelEncoder ou LabelBinarizer pour la target (selon le type de modèle qu'on utilisera) ```python categories = ["island", "sex"] encoder = OneHotEncoder(sparse_output=False, drop='first') encoder.fit_transform(df[categories]) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>island_Dream</th> <th>island_Torgersen</th> <th>sex_Male</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0.0</td> <td>1.0</td> <td>1.0</td> </tr> <tr> <th>1</th> <td>0.0</td> <td>1.0</td> <td>0.0</td> </tr> <tr> <th>2</th> <td>0.0</td> <td>1.0</td> <td>0.0</td> </tr> <tr> <th>4</th> <td>0.0</td> <td>1.0</td> <td>0.0</td> </tr> <tr> <th>5</th> <td>0.0</td> <td>1.0</td> <td>1.0</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>338</th> <td>0.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>340</th> <td>0.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>341</th> <td>0.0</td> <td>0.0</td> <td>1.0</td> </tr> <tr> <th>342</th> <td>0.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>343</th> <td>0.0</td> <td>0.0</td> <td>1.0</td> </tr> </tbody> </table> <p>333 rows × 3 columns</p> </div> ```python # Pour la target : résultat avec LabelBinarizer LabelBinarizer().fit_transform(df["species"])[:10] ``` array([[1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0]]) ```python # Pour la target : résultat avec LabelEncoder LabelEncoder().fit_transform(df["species"]) ``` array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) ### 3. Flights ```python df = sns.load_dataset("flights") # Target : passengers df.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>year</th> <th>month</th> <th>passengers</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1949</td> <td>Jan</td> <td>112</td> </tr> <tr> <th>1</th> <td>1949</td> <td>Feb</td> <td>118</td> </tr> <tr> <th>2</th> <td>1949</td> <td>Mar</td> <td>132</td> </tr> <tr> <th>3</th> <td>1949</td> <td>Apr</td> <td>129</td> </tr> <tr> <th>4</th> <td>1949</td> <td>May</td> <td>121</td> </tr> </tbody> </table> </div> ```python df["month"].unique() ``` ['Jan', 'Feb', 'Mar', 'Apr', 'May', ..., 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] Length: 12 Categories (12, object): ['Jan', 'Feb', 'Mar', 'Apr', ..., 'Sep', 'Oct', 'Nov', 'Dec'] - Les mois de l'année peuvent généralement être manipulés sous forme ordinale - Un encodage OneHot peut également convenir, mais cela rajoute 12 colonnes dans notre jeu de données, ce qui donne plus de chance d'overfitting à notre modèle et donne de moins bonnes performances ```python encoder = OneHotEncoder(sparse_output=False, drop="first") sns.heatmap(encoder.fit_transform(df[["month"]])) ``` <Axes: > ![png](exo_encoding_25_1.png) Ce n'est pas une erreur mais il ya beaucoup de colonnes. On aurait pû encoder avec un ordinal encoder sur la variable "month" pour n'avoir plus qu'une colonne. ```python month_order = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"] encoder = OrdinalEncoder(categories=[month_order]) encoder.fit_transform(df[["month"]]) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>month</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0.0</td> </tr> <tr> <th>1</th> <td>1.0</td> </tr> <tr> <th>2</th> <td>2.0</td> </tr> <tr> <th>3</th> <td>3.0</td> </tr> <tr> <th>4</th> <td>4.0</td> </tr> <tr> <th>...</th> <td>...</td> </tr> <tr> <th>139</th> <td>7.0</td> </tr> <tr> <th>140</th> <td>8.0</td> </tr> <tr> <th>141</th> <td>9.0</td> </tr> <tr> <th>142</th> <td>10.0</td> </tr> <tr> <th>143</th> <td>11.0</td> </tr> </tbody> </table> <p>144 rows × 1 columns</p> </div> ### 4. Exercice ```python df = sns.load_dataset("exercise") # Target : pulse df.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Unnamed: 0</th> <th>id</th> <th>diet</th> <th>pulse</th> <th>time</th> <th>kind</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0</td> <td>1</td> <td>low fat</td> <td>85</td> <td>1 min</td> <td>rest</td> </tr> <tr> <th>1</th> <td>1</td> <td>1</td> <td>low fat</td> <td>85</td> <td>15 min</td> <td>rest</td> </tr> <tr> <th>2</th> <td>2</td> <td>1</td> <td>low fat</td> <td>88</td> <td>30 min</td> <td>rest</td> </tr> <tr> <th>3</th> <td>3</td> <td>2</td> <td>low fat</td> <td>90</td> <td>1 min</td> <td>rest</td> </tr> <tr> <th>4</th> <td>4</td> <td>2</td> <td>low fat</td> <td>92</td> <td>15 min</td> <td>rest</td> </tr> </tbody> </table> </div> ```python df["diet"].unique() ``` ['low fat', 'no fat'] Categories (2, object): ['no fat', 'low fat'] ```python df["time"].unique() ``` ['1 min', '15 min', '30 min'] Categories (3, object): ['1 min', '15 min', '30 min'] ```python df["kind"].unique() ``` ['rest', 'walking', 'running'] Categories (3, object): ['rest', 'walking', 'running'] - *diet* est une variable binaire, on peut l'encoder quelque soit la méthode - *kind* est une varaiable ordinale - *time* est une variable ordinale mais **Attention** il ne faut pas encoder la variable telle quelle mais en extraire la valeur numérique qu'elle contien déjà. L'écart entre les valeurs peut-être important (même si dans ce cas-ci c'est le même: 15min) Les 3 variables semblent ordinales ```python diet_order = ["no fat", "low fat"] kind_order = ["rest", "walking", "running"] encoder = OrdinalEncoder(categories=[diet_order, kind_order]) encoder.fit_transform(df[["diet", "kind"]]) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>diet</th> <th>kind</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1.0</td> <td>0.0</td> </tr> <tr> <th>1</th> <td>1.0</td> <td>0.0</td> </tr> <tr> <th>2</th> <td>1.0</td> <td>0.0</td> </tr> <tr> <th>3</th> <td>1.0</td> <td>0.0</td> </tr> <tr> <th>4</th> <td>1.0</td> <td>0.0</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> </tr> <tr> <th>85</th> <td>0.0</td> <td>2.0</td> </tr> <tr> <th>86</th> <td>0.0</td> <td>2.0</td> </tr> <tr> <th>87</th> <td>0.0</td> <td>2.0</td> </tr> <tr> <th>88</th> <td>0.0</td> <td>2.0</td> </tr> <tr> <th>89</th> <td>0.0</td> <td>2.0</td> </tr> </tbody> </table> <p>90 rows × 2 columns</p> </div> ```python df["time"].replace({'1 min': 1, '15 min': 15, '30 min': 30}) ``` /tmp/ipykernel_58826/387457223.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` df["time"].replace({'1 min': 1, '15 min': 15, '30 min': 30}) /tmp/ipykernel_58826/387457223.py:1: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead. df["time"].replace({'1 min': 1, '15 min': 15, '30 min': 30}) 0 1 1 15 2 30 3 1 4 15 .. 85 15 86 30 87 1 88 15 89 30 Name: time, Length: 90, dtype: category Categories (3, int64): [1, 15, 30] ### 5. Taxis ```python df = sns.load_dataset("taxis") # Target : total df.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>pickup</th> <th>dropoff</th> <th>passengers</th> <th>distance</th> <th>fare</th> <th>tip</th> <th>tolls</th> <th>total</th> <th>color</th> <th>payment</th> <th>pickup_zone</th> <th>dropoff_zone</th> <th>pickup_borough</th> <th>dropoff_borough</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>2019-03-23 20:21:09</td> <td>2019-03-23 20:27:24</td> <td>1</td> <td>1.60</td> <td>7.0</td> <td>2.15</td> <td>0.0</td> <td>12.95</td> <td>yellow</td> <td>credit card</td> <td>Lenox Hill West</td> <td>UN/Turtle Bay South</td> <td>Manhattan</td> <td>Manhattan</td> </tr> <tr> <th>1</th> <td>2019-03-04 16:11:55</td> <td>2019-03-04 16:19:00</td> <td>1</td> <td>0.79</td> <td>5.0</td> <td>0.00</td> <td>0.0</td> <td>9.30</td> <td>yellow</td> <td>cash</td> <td>Upper West Side South</td> <td>Upper West Side South</td> <td>Manhattan</td> <td>Manhattan</td> </tr> <tr> <th>2</th> <td>2019-03-27 17:53:01</td> <td>2019-03-27 18:00:25</td> <td>1</td> <td>1.37</td> <td>7.5</td> <td>2.36</td> <td>0.0</td> <td>14.16</td> <td>yellow</td> <td>credit card</td> <td>Alphabet City</td> <td>West Village</td> <td>Manhattan</td> <td>Manhattan</td> </tr> <tr> <th>3</th> <td>2019-03-10 01:23:59</td> <td>2019-03-10 01:49:51</td> <td>1</td> <td>7.70</td> <td>27.0</td> <td>6.15</td> <td>0.0</td> <td>36.95</td> <td>yellow</td> <td>credit card</td> <td>Hudson Sq</td> <td>Yorkville West</td> <td>Manhattan</td> <td>Manhattan</td> </tr> <tr> <th>4</th> <td>2019-03-30 13:27:42</td> <td>2019-03-30 13:37:14</td> <td>3</td> <td>2.16</td> <td>9.0</td> <td>1.10</td> <td>0.0</td> <td>13.40</td> <td>yellow</td> <td>credit card</td> <td>Midtown East</td> <td>Yorkville West</td> <td>Manhattan</td> <td>Manhattan</td> </tr> </tbody> </table> </div> ```python df["color"].unique() ``` array(['yellow', 'green'], dtype=object) ```python df["payment"].unique() ``` array(['credit card', 'cash', nan], dtype=object) ```python df["pickup_zone"].nunique() ``` 194 194 pickup_zone différentes ! ```python df["dropoff_zone"].nunique() ``` 203 203 dropoff_zone différentes ! ```python df["pickup_borough"].unique() ``` array(['Manhattan', 'Queens', nan, 'Bronx', 'Brooklyn'], dtype=object) ```python df["dropoff_borough"].unique() ``` array(['Manhattan', 'Queens', 'Brooklyn', nan, 'Bronx', 'Staten Island'], dtype=object) Toutes les variables semblent être nominales. Il convient donc d'utiliser un encodage OneHot. Cependant, le grand nombre de catégories dans *pickup_zone* et *dropoff_zone* risque de donner une matrice creuse qui engendrera un overfitting et d'autres inconvénients. Dans la pratique, il convient d'utiliser d'autres techniques comme: - un Target Encoding - Un encodage GPS - Un feature-engineering consistant à regrouper les zones en clusters / proximité, banlieue, .. ```python sns.heatmap(pd.crosstab(df["dropoff_zone"], df["dropoff_borough"])) ``` <Axes: xlabel='dropoff_borough', ylabel='dropoff_zone'> ![png](exo_encoding_48_1.png) ```python unique_pairs = df[["pickup_zone", "pickup_borough"]].drop_duplicates() ``` ```python unique_pairs.groupby("pickup_zone")["pickup_borough"].size().max() ``` 1 On a donc une corrélation unique entre les zones et les quartiers On a déjà des variables qui sont le rassemblement (clusters) d'autres variables --> on OneHot uniquement les "Boroughs"