# Librairies
```python
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder
import sklearn
print(sklearn.__version__)
```
1.5.1
# Load Diamonds
```python
df = sns.load_dataset('diamonds')
df.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>carat</th>
<th>cut</th>
<th>color</th>
<th>clarity</th>
<th>depth</th>
<th>table</th>
<th>price</th>
<th>x</th>
<th>y</th>
<th>z</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.23</td>
<td>Ideal</td>
<td>E</td>
<td>SI2</td>
<td>61.5</td>
<td>55.0</td>
<td>326</td>
<td>3.95</td>
<td>3.98</td>
<td>2.43</td>
</tr>
<tr>
<th>1</th>
<td>0.21</td>
<td>Premium</td>
<td>E</td>
<td>SI1</td>
<td>59.8</td>
<td>61.0</td>
<td>326</td>
<td>3.89</td>
<td>3.84</td>
<td>2.31</td>
</tr>
<tr>
<th>2</th>
<td>0.23</td>
<td>Good</td>
<td>E</td>
<td>VS1</td>
<td>56.9</td>
<td>65.0</td>
<td>327</td>
<td>4.05</td>
<td>4.07</td>
<td>2.31</td>
</tr>
<tr>
<th>3</th>
<td>0.29</td>
<td>Premium</td>
<td>I</td>
<td>VS2</td>
<td>62.4</td>
<td>58.0</td>
<td>334</td>
<td>4.20</td>
<td>4.23</td>
<td>2.63</td>
</tr>
<tr>
<th>4</th>
<td>0.31</td>
<td>Good</td>
<td>J</td>
<td>SI2</td>
<td>63.3</td>
<td>58.0</td>
<td>335</td>
<td>4.34</td>
<td>4.35</td>
<td>2.75</td>
</tr>
</tbody>
</table>
</div>
# Utilisation de OrdinalEncoder
- Réservé à l'encodage des variables ordinales (avec une hiérarchie). Ici, après recherche, toutes les colonnes sont ordinales.
- l'argument catégories permet de gérer l'ordre
- handle_unknown="use_encoded_value" et unknown_value=-1 permet de rendre le code plus robuste
```python
df["cut"].unique()
```
['Ideal', 'Premium', 'Good', 'Very Good', 'Fair']
Categories (5, object): ['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']
```python
df["color"].unique()
```
['E', 'I', 'J', 'H', 'F', 'G', 'D']
Categories (7, object): ['D', 'E', 'F', 'G', 'H', 'I', 'J']
```python
df["clarity"].unique()
```
['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF']
Categories (8, object): ['IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1']
```python
cut_order = ['Very Bad', 'Bad', 'Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
color_order = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
clarity_order =[ 'I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
encoder = OrdinalEncoder(
categories=[cut_order, color_order, clarity_order],
handle_unknown='use_encoded_value',
unknown_value=-1)
encoder.fit(df[['cut', 'color', 'clarity']])
encoder.transform(df[['cut', 'color', 'clarity']])
```
array([[6., 5., 1.],
[5., 5., 2.],
[3., 5., 4.],
...,
[4., 6., 2.],
[5., 2., 1.],
[6., 6., 1.]])
Dans le cas où on rencontre l'encoder tombe sur une catégorie qu'il n'a jamais vu, on obtiendra une erreur.
Avec l'argument handle_unknown, on peut réserver une nouvelle catégorie pour les variables inconnues.
```python
encoder.transform([["Fair", "G", "A+"]])
```
/home/steph/anaconda3/lib/python3.12/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but OrdinalEncoder was fitted with feature names
warnings.warn(
array([[ 2., 3., -1.]])
Astuce : on peut ajouter en prévision des catégories qui n'existent pas encore dans nos données (ex: 'Very bad' et 'Bad' pour la variable 'cut order'), du moment qu'elles respectent l'ordre hiérarchique
```python
encoder.transform([["Bad", "G", "A+"]])
```
/home/steph/anaconda3/lib/python3.12/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but OrdinalEncoder was fitted with feature names
warnings.warn(
array([[ 1., 3., -1.]])