ordinal_encoding

# Librairies ```python import numpy as np import pandas as pd import seaborn as sns from sklearn.preprocessing import OrdinalEncoder import sklearn print(sklearn.__version__) ``` 1.5.1 # Load Diamonds ```python df = sns.load_dataset('diamonds') df.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>carat</th> <th>cut</th> <th>color</th> <th>clarity</th> <th>depth</th> <th>table</th> <th>price</th> <th>x</th> <th>y</th> <th>z</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0.23</td> <td>Ideal</td> <td>E</td> <td>SI2</td> <td>61.5</td> <td>55.0</td> <td>326</td> <td>3.95</td> <td>3.98</td> <td>2.43</td> </tr> <tr> <th>1</th> <td>0.21</td> <td>Premium</td> <td>E</td> <td>SI1</td> <td>59.8</td> <td>61.0</td> <td>326</td> <td>3.89</td> <td>3.84</td> <td>2.31</td> </tr> <tr> <th>2</th> <td>0.23</td> <td>Good</td> <td>E</td> <td>VS1</td> <td>56.9</td> <td>65.0</td> <td>327</td> <td>4.05</td> <td>4.07</td> <td>2.31</td> </tr> <tr> <th>3</th> <td>0.29</td> <td>Premium</td> <td>I</td> <td>VS2</td> <td>62.4</td> <td>58.0</td> <td>334</td> <td>4.20</td> <td>4.23</td> <td>2.63</td> </tr> <tr> <th>4</th> <td>0.31</td> <td>Good</td> <td>J</td> <td>SI2</td> <td>63.3</td> <td>58.0</td> <td>335</td> <td>4.34</td> <td>4.35</td> <td>2.75</td> </tr> </tbody> </table> </div> # Utilisation de OrdinalEncoder - Réservé à l'encodage des variables ordinales (avec une hiérarchie). Ici, après recherche, toutes les colonnes sont ordinales. - l'argument catégories permet de gérer l'ordre - handle_unknown="use_encoded_value" et unknown_value=-1 permet de rendre le code plus robuste ```python df["cut"].unique() ``` ['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'] Categories (5, object): ['Ideal', 'Premium', 'Very Good', 'Good', 'Fair'] ```python df["color"].unique() ``` ['E', 'I', 'J', 'H', 'F', 'G', 'D'] Categories (7, object): ['D', 'E', 'F', 'G', 'H', 'I', 'J'] ```python df["clarity"].unique() ``` ['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'] Categories (8, object): ['IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1'] ```python cut_order = ['Very Bad', 'Bad', 'Fair', 'Good', 'Very Good', 'Premium', 'Ideal'] color_order = ['J', 'I', 'H', 'G', 'F', 'E', 'D'] clarity_order =[ 'I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'] encoder = OrdinalEncoder( categories=[cut_order, color_order, clarity_order], handle_unknown='use_encoded_value', unknown_value=-1) encoder.fit(df[['cut', 'color', 'clarity']]) encoder.transform(df[['cut', 'color', 'clarity']]) ``` array([[6., 5., 1.], [5., 5., 2.], [3., 5., 4.], ..., [4., 6., 2.], [5., 2., 1.], [6., 6., 1.]]) Dans le cas où on rencontre l'encoder tombe sur une catégorie qu'il n'a jamais vu, on obtiendra une erreur. Avec l'argument handle_unknown, on peut réserver une nouvelle catégorie pour les variables inconnues. ```python encoder.transform([["Fair", "G", "A+"]]) ``` /home/steph/anaconda3/lib/python3.12/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but OrdinalEncoder was fitted with feature names warnings.warn( array([[ 2., 3., -1.]]) Astuce : on peut ajouter en prévision des catégories qui n'existent pas encore dans nos données (ex: 'Very bad' et 'Bad' pour la variable 'cut order'), du moment qu'elles respectent l'ordre hiérarchique ```python encoder.transform([["Bad", "G", "A+"]]) ``` /home/steph/anaconda3/lib/python3.12/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but OrdinalEncoder was fitted with feature names warnings.warn( array([[ 1., 3., -1.]])