# Librairies
```python
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
```
# Load data
```python
df = sns.load_dataset('titanic')
df
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>survived</th>
<th>pclass</th>
<th>sex</th>
<th>age</th>
<th>sibsp</th>
<th>parch</th>
<th>fare</th>
<th>embarked</th>
<th>class</th>
<th>who</th>
<th>adult_male</th>
<th>deck</th>
<th>embark_town</th>
<th>alive</th>
<th>alone</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>3</td>
<td>male</td>
<td>22.0</td>
<td>1</td>
<td>0</td>
<td>7.2500</td>
<td>S</td>
<td>Third</td>
<td>man</td>
<td>True</td>
<td>NaN</td>
<td>Southampton</td>
<td>no</td>
<td>False</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>1</td>
<td>female</td>
<td>38.0</td>
<td>1</td>
<td>0</td>
<td>71.2833</td>
<td>C</td>
<td>First</td>
<td>woman</td>
<td>False</td>
<td>C</td>
<td>Cherbourg</td>
<td>yes</td>
<td>False</td>
</tr>
<tr>
<th>2</th>
<td>1</td>
<td>3</td>
<td>female</td>
<td>26.0</td>
<td>0</td>
<td>0</td>
<td>7.9250</td>
<td>S</td>
<td>Third</td>
<td>woman</td>
<td>False</td>
<td>NaN</td>
<td>Southampton</td>
<td>yes</td>
<td>True</td>
</tr>
<tr>
<th>3</th>
<td>1</td>
<td>1</td>
<td>female</td>
<td>35.0</td>
<td>1</td>
<td>0</td>
<td>53.1000</td>
<td>S</td>
<td>First</td>
<td>woman</td>
<td>False</td>
<td>C</td>
<td>Southampton</td>
<td>yes</td>
<td>False</td>
</tr>
<tr>
<th>4</th>
<td>0</td>
<td>3</td>
<td>male</td>
<td>35.0</td>
<td>0</td>
<td>0</td>
<td>8.0500</td>
<td>S</td>
<td>Third</td>
<td>man</td>
<td>True</td>
<td>NaN</td>
<td>Southampton</td>
<td>no</td>
<td>True</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>886</th>
<td>0</td>
<td>2</td>
<td>male</td>
<td>27.0</td>
<td>0</td>
<td>0</td>
<td>13.0000</td>
<td>S</td>
<td>Second</td>
<td>man</td>
<td>True</td>
<td>NaN</td>
<td>Southampton</td>
<td>no</td>
<td>True</td>
</tr>
<tr>
<th>887</th>
<td>1</td>
<td>1</td>
<td>female</td>
<td>19.0</td>
<td>0</td>
<td>0</td>
<td>30.0000</td>
<td>S</td>
<td>First</td>
<td>woman</td>
<td>False</td>
<td>B</td>
<td>Southampton</td>
<td>yes</td>
<td>True</td>
</tr>
<tr>
<th>888</th>
<td>0</td>
<td>3</td>
<td>female</td>
<td>NaN</td>
<td>1</td>
<td>2</td>
<td>23.4500</td>
<td>S</td>
<td>Third</td>
<td>woman</td>
<td>False</td>
<td>NaN</td>
<td>Southampton</td>
<td>no</td>
<td>False</td>
</tr>
<tr>
<th>889</th>
<td>1</td>
<td>1</td>
<td>male</td>
<td>26.0</td>
<td>0</td>
<td>0</td>
<td>30.0000</td>
<td>C</td>
<td>First</td>
<td>man</td>
<td>True</td>
<td>C</td>
<td>Cherbourg</td>
<td>yes</td>
<td>True</td>
</tr>
<tr>
<th>890</th>
<td>0</td>
<td>3</td>
<td>male</td>
<td>32.0</td>
<td>0</td>
<td>0</td>
<td>7.7500</td>
<td>Q</td>
<td>Third</td>
<td>man</td>
<td>True</td>
<td>NaN</td>
<td>Queenstown</td>
<td>no</td>
<td>True</td>
</tr>
</tbody>
</table>
<p>891 rows × 15 columns</p>
</div>
# L'erreur à ne pas faire
```python
df.dropna()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>survived</th>
<th>pclass</th>
<th>sex</th>
<th>age</th>
<th>sibsp</th>
<th>parch</th>
<th>fare</th>
<th>embarked</th>
<th>class</th>
<th>who</th>
<th>adult_male</th>
<th>deck</th>
<th>embark_town</th>
<th>alive</th>
<th>alone</th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>1</td>
<td>1</td>
<td>female</td>
<td>38.0</td>
<td>1</td>
<td>0</td>
<td>71.2833</td>
<td>C</td>
<td>First</td>
<td>woman</td>
<td>False</td>
<td>C</td>
<td>Cherbourg</td>
<td>yes</td>
<td>False</td>
</tr>
<tr>
<th>3</th>
<td>1</td>
<td>1</td>
<td>female</td>
<td>35.0</td>
<td>1</td>
<td>0</td>
<td>53.1000</td>
<td>S</td>
<td>First</td>
<td>woman</td>
<td>False</td>
<td>C</td>
<td>Southampton</td>
<td>yes</td>
<td>False</td>
</tr>
<tr>
<th>6</th>
<td>0</td>
<td>1</td>
<td>male</td>
<td>54.0</td>
<td>0</td>
<td>0</td>
<td>51.8625</td>
<td>S</td>
<td>First</td>
<td>man</td>
<td>True</td>
<td>E</td>
<td>Southampton</td>
<td>no</td>
<td>True</td>
</tr>
<tr>
<th>10</th>
<td>1</td>
<td>3</td>
<td>female</td>
<td>4.0</td>
<td>1</td>
<td>1</td>
<td>16.7000</td>
<td>S</td>
<td>Third</td>
<td>child</td>
<td>False</td>
<td>G</td>
<td>Southampton</td>
<td>yes</td>
<td>False</td>
</tr>
<tr>
<th>11</th>
<td>1</td>
<td>1</td>
<td>female</td>
<td>58.0</td>
<td>0</td>
<td>0</td>
<td>26.5500</td>
<td>S</td>
<td>First</td>
<td>woman</td>
<td>False</td>
<td>C</td>
<td>Southampton</td>
<td>yes</td>
<td>True</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>871</th>
<td>1</td>
<td>1</td>
<td>female</td>
<td>47.0</td>
<td>1</td>
<td>1</td>
<td>52.5542</td>
<td>S</td>
<td>First</td>
<td>woman</td>
<td>False</td>
<td>D</td>
<td>Southampton</td>
<td>yes</td>
<td>False</td>
</tr>
<tr>
<th>872</th>
<td>0</td>
<td>1</td>
<td>male</td>
<td>33.0</td>
<td>0</td>
<td>0</td>
<td>5.0000</td>
<td>S</td>
<td>First</td>
<td>man</td>
<td>True</td>
<td>B</td>
<td>Southampton</td>
<td>no</td>
<td>True</td>
</tr>
<tr>
<th>879</th>
<td>1</td>
<td>1</td>
<td>female</td>
<td>56.0</td>
<td>0</td>
<td>1</td>
<td>83.1583</td>
<td>C</td>
<td>First</td>
<td>woman</td>
<td>False</td>
<td>C</td>
<td>Cherbourg</td>
<td>yes</td>
<td>False</td>
</tr>
<tr>
<th>887</th>
<td>1</td>
<td>1</td>
<td>female</td>
<td>19.0</td>
<td>0</td>
<td>0</td>
<td>30.0000</td>
<td>S</td>
<td>First</td>
<td>woman</td>
<td>False</td>
<td>B</td>
<td>Southampton</td>
<td>yes</td>
<td>True</td>
</tr>
<tr>
<th>889</th>
<td>1</td>
<td>1</td>
<td>male</td>
<td>26.0</td>
<td>0</td>
<td>0</td>
<td>30.0000</td>
<td>C</td>
<td>First</td>
<td>man</td>
<td>True</td>
<td>C</td>
<td>Cherbourg</td>
<td>yes</td>
<td>True</td>
</tr>
</tbody>
</table>
<p>182 rows × 15 columns</p>
</div>
--> suprime beaucoups de données dans notre dataset (on passe de 891 lignes à 182 lignes)
# 1. Diagnostiquer le dataset
1. Y-a-t'il des valeurs manquantes ?
2. Combien ?
3. Comment sont-elles réparties ?
4. Filtrer le jeu de données
5. Analyser les entrées pour tenter de comprendre pourquoi ces valeurs sont NaN
```python
df.isna().sum(axis=0)
```
survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
dtype: int64
```python
sns.heatmap(df.isna())
```
<Axes: >

```python
na_index = df[df.isna().any(axis=1)].index
```
# 2. Éliminer les NaN
Objectif : conserver le plus de données possibles
Il faut choisir entre les lignes et les colonnes.
- Éviter d'éliminer une trop grande partie des lignes
- Si plus de 50% d'une colonne est constituée de NaN, et si on plus elle ne présente aucune corrélation avec la target, autant éliminer cette colonne avant d'éliminer des lignes
```python
df.drop(labels="deck", axis=1, inplace=True)
```
```python
sns.heatmap(df.isna())
```
<Axes: >
