# Librairies
```python
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, LabelBinarizer, OneHotEncoder
from sklearn import set_config
set_config(transform_output="pandas") # la sortie de la transformation est un DataFrame pandas
```
# Exercices
Transformer tous les Datasets en format numérique
### 1. Tips
```python
df = sns.load_dataset("tips")
# Target : tip
df.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>total_bill</th>
<th>tip</th>
<th>sex</th>
<th>smoker</th>
<th>day</th>
<th>time</th>
<th>size</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>16.99</td>
<td>1.01</td>
<td>Female</td>
<td>No</td>
<td>Sun</td>
<td>Dinner</td>
<td>2</td>
</tr>
<tr>
<th>1</th>
<td>10.34</td>
<td>1.66</td>
<td>Male</td>
<td>No</td>
<td>Sun</td>
<td>Dinner</td>
<td>3</td>
</tr>
<tr>
<th>2</th>
<td>21.01</td>
<td>3.50</td>
<td>Male</td>
<td>No</td>
<td>Sun</td>
<td>Dinner</td>
<td>3</td>
</tr>
<tr>
<th>3</th>
<td>23.68</td>
<td>3.31</td>
<td>Male</td>
<td>No</td>
<td>Sun</td>
<td>Dinner</td>
<td>2</td>
</tr>
<tr>
<th>4</th>
<td>24.59</td>
<td>3.61</td>
<td>Female</td>
<td>No</td>
<td>Sun</td>
<td>Dinner</td>
<td>4</td>
</tr>
</tbody>
</table>
</div>
```python
df["sex"].unique()
```
['Female', 'Male']
Categories (2, object): ['Male', 'Female']
```python
df["smoker"].unique()
```
['No', 'Yes']
Categories (2, object): ['Yes', 'No']
```python
df["day"].unique()
```
['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Thur', 'Fri', 'Sat', 'Sun']
```python
df["time"].unique()
```
['Dinner', 'Lunch']
Categories (2, object): ['Lunch', 'Dinner']
Toutes les variables catégorielles sont nominales (et pas ordinales) --> OneHotEncoder
```python
categories = ["sex", "smoker", "day", "time"]
encoder = OneHotEncoder(sparse_output=False, drop="first")
encoder.fit_transform(df[categories])
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>sex_Male</th>
<th>smoker_Yes</th>
<th>day_Sat</th>
<th>day_Sun</th>
<th>day_Thur</th>
<th>time_Lunch</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>1</th>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>2</th>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>3</th>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>4</th>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>239</th>
<td>1.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>240</th>
<td>0.0</td>
<td>1.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>241</th>
<td>1.0</td>
<td>1.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>242</th>
<td>1.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>243</th>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
</tr>
</tbody>
</table>
<p>244 rows × 6 columns</p>
</div>
### 2. Penguins
```python
df = sns.load_dataset("penguins")
# Target : species
df.dropna(inplace=True)
df.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>species</th>
<th>island</th>
<th>bill_length_mm</th>
<th>bill_depth_mm</th>
<th>flipper_length_mm</th>
<th>body_mass_g</th>
<th>sex</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Adelie</td>
<td>Torgersen</td>
<td>39.1</td>
<td>18.7</td>
<td>181.0</td>
<td>3750.0</td>
<td>Male</td>
</tr>
<tr>
<th>1</th>
<td>Adelie</td>
<td>Torgersen</td>
<td>39.5</td>
<td>17.4</td>
<td>186.0</td>
<td>3800.0</td>
<td>Female</td>
</tr>
<tr>
<th>2</th>
<td>Adelie</td>
<td>Torgersen</td>
<td>40.3</td>
<td>18.0</td>
<td>195.0</td>
<td>3250.0</td>
<td>Female</td>
</tr>
<tr>
<th>4</th>
<td>Adelie</td>
<td>Torgersen</td>
<td>36.7</td>
<td>19.3</td>
<td>193.0</td>
<td>3450.0</td>
<td>Female</td>
</tr>
<tr>
<th>5</th>
<td>Adelie</td>
<td>Torgersen</td>
<td>39.3</td>
<td>20.6</td>
<td>190.0</td>
<td>3650.0</td>
<td>Male</td>
</tr>
</tbody>
</table>
</div>
```python
df["species"].unique()
```
array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)
```python
df["island"].unique()
```
array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)
```python
df["sex"].unique()
```
array(['Male', 'Female'], dtype=object)
- Toutes les variables catégorielles sont nominales
- OneHotEncoder pour les variables
- LabelEncoder ou LabelBinarizer pour la target (selon le type de modèle qu'on utilisera)
```python
categories = ["island", "sex"]
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoder.fit_transform(df[categories])
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>island_Dream</th>
<th>island_Torgersen</th>
<th>sex_Male</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.0</td>
<td>1.0</td>
<td>1.0</td>
</tr>
<tr>
<th>1</th>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
</tr>
<tr>
<th>2</th>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
</tr>
<tr>
<th>4</th>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
</tr>
<tr>
<th>5</th>
<td>0.0</td>
<td>1.0</td>
<td>1.0</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>338</th>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>340</th>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>341</th>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
</tr>
<tr>
<th>342</th>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>343</th>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
</tr>
</tbody>
</table>
<p>333 rows × 3 columns</p>
</div>
```python
# Pour la target : résultat avec LabelBinarizer
LabelBinarizer().fit_transform(df["species"])[:10]
```
array([[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0]])
```python
# Pour la target : résultat avec LabelEncoder
LabelEncoder().fit_transform(df["species"])
```
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2])
### 3. Flights
```python
df = sns.load_dataset("flights")
# Target : passengers
df.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>year</th>
<th>month</th>
<th>passengers</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1949</td>
<td>Jan</td>
<td>112</td>
</tr>
<tr>
<th>1</th>
<td>1949</td>
<td>Feb</td>
<td>118</td>
</tr>
<tr>
<th>2</th>
<td>1949</td>
<td>Mar</td>
<td>132</td>
</tr>
<tr>
<th>3</th>
<td>1949</td>
<td>Apr</td>
<td>129</td>
</tr>
<tr>
<th>4</th>
<td>1949</td>
<td>May</td>
<td>121</td>
</tr>
</tbody>
</table>
</div>
```python
df["month"].unique()
```
['Jan', 'Feb', 'Mar', 'Apr', 'May', ..., 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
Length: 12
Categories (12, object): ['Jan', 'Feb', 'Mar', 'Apr', ..., 'Sep', 'Oct', 'Nov', 'Dec']
- Les mois de l'année peuvent généralement être manipulés sous forme ordinale
- Un encodage OneHot peut également convenir, mais cela rajoute 12 colonnes dans notre jeu de données, ce qui donne plus de chance d'overfitting à notre modèle et donne de moins bonnes performances
```python
encoder = OneHotEncoder(sparse_output=False, drop="first")
sns.heatmap(encoder.fit_transform(df[["month"]]))
```
<Axes: >

Ce n'est pas une erreur mais il ya beaucoup de colonnes. On aurait pû encoder avec un ordinal encoder sur la variable "month" pour n'avoir plus qu'une colonne.
```python
month_order = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
encoder = OrdinalEncoder(categories=[month_order])
encoder.fit_transform(df[["month"]])
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>month</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.0</td>
</tr>
<tr>
<th>1</th>
<td>1.0</td>
</tr>
<tr>
<th>2</th>
<td>2.0</td>
</tr>
<tr>
<th>3</th>
<td>3.0</td>
</tr>
<tr>
<th>4</th>
<td>4.0</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
</tr>
<tr>
<th>139</th>
<td>7.0</td>
</tr>
<tr>
<th>140</th>
<td>8.0</td>
</tr>
<tr>
<th>141</th>
<td>9.0</td>
</tr>
<tr>
<th>142</th>
<td>10.0</td>
</tr>
<tr>
<th>143</th>
<td>11.0</td>
</tr>
</tbody>
</table>
<p>144 rows × 1 columns</p>
</div>
### 4. Exercice
```python
df = sns.load_dataset("exercise")
# Target : pulse
df.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Unnamed: 0</th>
<th>id</th>
<th>diet</th>
<th>pulse</th>
<th>time</th>
<th>kind</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>1</td>
<td>low fat</td>
<td>85</td>
<td>1 min</td>
<td>rest</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>1</td>
<td>low fat</td>
<td>85</td>
<td>15 min</td>
<td>rest</td>
</tr>
<tr>
<th>2</th>
<td>2</td>
<td>1</td>
<td>low fat</td>
<td>88</td>
<td>30 min</td>
<td>rest</td>
</tr>
<tr>
<th>3</th>
<td>3</td>
<td>2</td>
<td>low fat</td>
<td>90</td>
<td>1 min</td>
<td>rest</td>
</tr>
<tr>
<th>4</th>
<td>4</td>
<td>2</td>
<td>low fat</td>
<td>92</td>
<td>15 min</td>
<td>rest</td>
</tr>
</tbody>
</table>
</div>
```python
df["diet"].unique()
```
['low fat', 'no fat']
Categories (2, object): ['no fat', 'low fat']
```python
df["time"].unique()
```
['1 min', '15 min', '30 min']
Categories (3, object): ['1 min', '15 min', '30 min']
```python
df["kind"].unique()
```
['rest', 'walking', 'running']
Categories (3, object): ['rest', 'walking', 'running']
- *diet* est une variable binaire, on peut l'encoder quelque soit la méthode
- *kind* est une varaiable ordinale
- *time* est une variable ordinale mais **Attention** il ne faut pas encoder la variable telle quelle mais en extraire la valeur numérique qu'elle contien déjà. L'écart entre les valeurs peut-être important (même si dans ce cas-ci c'est le même: 15min)
Les 3 variables semblent ordinales
```python
diet_order = ["no fat", "low fat"]
kind_order = ["rest", "walking", "running"]
encoder = OrdinalEncoder(categories=[diet_order, kind_order])
encoder.fit_transform(df[["diet", "kind"]])
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>diet</th>
<th>kind</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1.0</td>
<td>0.0</td>
</tr>
<tr>
<th>1</th>
<td>1.0</td>
<td>0.0</td>
</tr>
<tr>
<th>2</th>
<td>1.0</td>
<td>0.0</td>
</tr>
<tr>
<th>3</th>
<td>1.0</td>
<td>0.0</td>
</tr>
<tr>
<th>4</th>
<td>1.0</td>
<td>0.0</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>85</th>
<td>0.0</td>
<td>2.0</td>
</tr>
<tr>
<th>86</th>
<td>0.0</td>
<td>2.0</td>
</tr>
<tr>
<th>87</th>
<td>0.0</td>
<td>2.0</td>
</tr>
<tr>
<th>88</th>
<td>0.0</td>
<td>2.0</td>
</tr>
<tr>
<th>89</th>
<td>0.0</td>
<td>2.0</td>
</tr>
</tbody>
</table>
<p>90 rows × 2 columns</p>
</div>
```python
df["time"].replace({'1 min': 1, '15 min': 15, '30 min': 30})
```
/tmp/ipykernel_58826/387457223.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df["time"].replace({'1 min': 1, '15 min': 15, '30 min': 30})
/tmp/ipykernel_58826/387457223.py:1: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
df["time"].replace({'1 min': 1, '15 min': 15, '30 min': 30})
0 1
1 15
2 30
3 1
4 15
..
85 15
86 30
87 1
88 15
89 30
Name: time, Length: 90, dtype: category
Categories (3, int64): [1, 15, 30]
### 5. Taxis
```python
df = sns.load_dataset("taxis")
# Target : total
df.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>pickup</th>
<th>dropoff</th>
<th>passengers</th>
<th>distance</th>
<th>fare</th>
<th>tip</th>
<th>tolls</th>
<th>total</th>
<th>color</th>
<th>payment</th>
<th>pickup_zone</th>
<th>dropoff_zone</th>
<th>pickup_borough</th>
<th>dropoff_borough</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2019-03-23 20:21:09</td>
<td>2019-03-23 20:27:24</td>
<td>1</td>
<td>1.60</td>
<td>7.0</td>
<td>2.15</td>
<td>0.0</td>
<td>12.95</td>
<td>yellow</td>
<td>credit card</td>
<td>Lenox Hill West</td>
<td>UN/Turtle Bay South</td>
<td>Manhattan</td>
<td>Manhattan</td>
</tr>
<tr>
<th>1</th>
<td>2019-03-04 16:11:55</td>
<td>2019-03-04 16:19:00</td>
<td>1</td>
<td>0.79</td>
<td>5.0</td>
<td>0.00</td>
<td>0.0</td>
<td>9.30</td>
<td>yellow</td>
<td>cash</td>
<td>Upper West Side South</td>
<td>Upper West Side South</td>
<td>Manhattan</td>
<td>Manhattan</td>
</tr>
<tr>
<th>2</th>
<td>2019-03-27 17:53:01</td>
<td>2019-03-27 18:00:25</td>
<td>1</td>
<td>1.37</td>
<td>7.5</td>
<td>2.36</td>
<td>0.0</td>
<td>14.16</td>
<td>yellow</td>
<td>credit card</td>
<td>Alphabet City</td>
<td>West Village</td>
<td>Manhattan</td>
<td>Manhattan</td>
</tr>
<tr>
<th>3</th>
<td>2019-03-10 01:23:59</td>
<td>2019-03-10 01:49:51</td>
<td>1</td>
<td>7.70</td>
<td>27.0</td>
<td>6.15</td>
<td>0.0</td>
<td>36.95</td>
<td>yellow</td>
<td>credit card</td>
<td>Hudson Sq</td>
<td>Yorkville West</td>
<td>Manhattan</td>
<td>Manhattan</td>
</tr>
<tr>
<th>4</th>
<td>2019-03-30 13:27:42</td>
<td>2019-03-30 13:37:14</td>
<td>3</td>
<td>2.16</td>
<td>9.0</td>
<td>1.10</td>
<td>0.0</td>
<td>13.40</td>
<td>yellow</td>
<td>credit card</td>
<td>Midtown East</td>
<td>Yorkville West</td>
<td>Manhattan</td>
<td>Manhattan</td>
</tr>
</tbody>
</table>
</div>
```python
df["color"].unique()
```
array(['yellow', 'green'], dtype=object)
```python
df["payment"].unique()
```
array(['credit card', 'cash', nan], dtype=object)
```python
df["pickup_zone"].nunique()
```
194
194 pickup_zone différentes !
```python
df["dropoff_zone"].nunique()
```
203
203 dropoff_zone différentes !
```python
df["pickup_borough"].unique()
```
array(['Manhattan', 'Queens', nan, 'Bronx', 'Brooklyn'], dtype=object)
```python
df["dropoff_borough"].unique()
```
array(['Manhattan', 'Queens', 'Brooklyn', nan, 'Bronx', 'Staten Island'],
dtype=object)
Toutes les variables semblent être nominales. Il convient donc d'utiliser un encodage OneHot. Cependant, le grand nombre de catégories dans *pickup_zone* et *dropoff_zone* risque de donner une matrice creuse qui engendrera un overfitting et d'autres inconvénients. Dans la pratique, il convient d'utiliser d'autres techniques comme:
- un Target Encoding
- Un encodage GPS
- Un feature-engineering consistant à regrouper les zones en clusters / proximité, banlieue, ..
```python
sns.heatmap(pd.crosstab(df["dropoff_zone"], df["dropoff_borough"]))
```
<Axes: xlabel='dropoff_borough', ylabel='dropoff_zone'>

```python
unique_pairs = df[["pickup_zone", "pickup_borough"]].drop_duplicates()
```
```python
unique_pairs.groupby("pickup_zone")["pickup_borough"].size().max()
```
1
On a donc une corrélation unique entre les zones et les quartiers
On a déjà des variables qui sont le rassemblement (clusters) d'autres variables --> on OneHot uniquement les "Boroughs"