# Librairies ```python import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns ``` # Load data ```python df = sns.load_dataset('titanic') df ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>survived</th> <th>pclass</th> <th>sex</th> <th>age</th> <th>sibsp</th> <th>parch</th> <th>fare</th> <th>embarked</th> <th>class</th> <th>who</th> <th>adult_male</th> <th>deck</th> <th>embark_town</th> <th>alive</th> <th>alone</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0</td> <td>3</td> <td>male</td> <td>22.0</td> <td>1</td> <td>0</td> <td>7.2500</td> <td>S</td> <td>Third</td> <td>man</td> <td>True</td> <td>NaN</td> <td>Southampton</td> <td>no</td> <td>False</td> </tr> <tr> <th>1</th> <td>1</td> <td>1</td> <td>female</td> <td>38.0</td> <td>1</td> <td>0</td> <td>71.2833</td> <td>C</td> <td>First</td> <td>woman</td> <td>False</td> <td>C</td> <td>Cherbourg</td> <td>yes</td> <td>False</td> </tr> <tr> <th>2</th> <td>1</td> <td>3</td> <td>female</td> <td>26.0</td> <td>0</td> <td>0</td> <td>7.9250</td> <td>S</td> <td>Third</td> <td>woman</td> <td>False</td> <td>NaN</td> <td>Southampton</td> <td>yes</td> <td>True</td> </tr> <tr> <th>3</th> <td>1</td> <td>1</td> <td>female</td> <td>35.0</td> <td>1</td> <td>0</td> <td>53.1000</td> <td>S</td> <td>First</td> <td>woman</td> <td>False</td> <td>C</td> <td>Southampton</td> <td>yes</td> <td>False</td> </tr> <tr> <th>4</th> <td>0</td> <td>3</td> <td>male</td> <td>35.0</td> <td>0</td> <td>0</td> <td>8.0500</td> <td>S</td> <td>Third</td> <td>man</td> <td>True</td> <td>NaN</td> <td>Southampton</td> <td>no</td> <td>True</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>886</th> <td>0</td> <td>2</td> <td>male</td> <td>27.0</td> <td>0</td> <td>0</td> <td>13.0000</td> <td>S</td> <td>Second</td> <td>man</td> <td>True</td> <td>NaN</td> <td>Southampton</td> <td>no</td> <td>True</td> </tr> <tr> <th>887</th> <td>1</td> <td>1</td> <td>female</td> <td>19.0</td> <td>0</td> <td>0</td> <td>30.0000</td> <td>S</td> <td>First</td> <td>woman</td> <td>False</td> <td>B</td> <td>Southampton</td> <td>yes</td> <td>True</td> </tr> <tr> <th>888</th> <td>0</td> <td>3</td> <td>female</td> <td>NaN</td> <td>1</td> <td>2</td> <td>23.4500</td> <td>S</td> <td>Third</td> <td>woman</td> <td>False</td> <td>NaN</td> <td>Southampton</td> <td>no</td> <td>False</td> </tr> <tr> <th>889</th> <td>1</td> <td>1</td> <td>male</td> <td>26.0</td> <td>0</td> <td>0</td> <td>30.0000</td> <td>C</td> <td>First</td> <td>man</td> <td>True</td> <td>C</td> <td>Cherbourg</td> <td>yes</td> <td>True</td> </tr> <tr> <th>890</th> <td>0</td> <td>3</td> <td>male</td> <td>32.0</td> <td>0</td> <td>0</td> <td>7.7500</td> <td>Q</td> <td>Third</td> <td>man</td> <td>True</td> <td>NaN</td> <td>Queenstown</td> <td>no</td> <td>True</td> </tr> </tbody> </table> <p>891 rows × 15 columns</p> </div> # L'erreur à ne pas faire ```python df.dropna() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>survived</th> <th>pclass</th> <th>sex</th> <th>age</th> <th>sibsp</th> <th>parch</th> <th>fare</th> <th>embarked</th> <th>class</th> <th>who</th> <th>adult_male</th> <th>deck</th> <th>embark_town</th> <th>alive</th> <th>alone</th> </tr> </thead> <tbody> <tr> <th>1</th> <td>1</td> <td>1</td> <td>female</td> <td>38.0</td> <td>1</td> <td>0</td> <td>71.2833</td> <td>C</td> <td>First</td> <td>woman</td> <td>False</td> <td>C</td> <td>Cherbourg</td> <td>yes</td> <td>False</td> </tr> <tr> <th>3</th> <td>1</td> <td>1</td> <td>female</td> <td>35.0</td> <td>1</td> <td>0</td> <td>53.1000</td> <td>S</td> <td>First</td> <td>woman</td> <td>False</td> <td>C</td> <td>Southampton</td> <td>yes</td> <td>False</td> </tr> <tr> <th>6</th> <td>0</td> <td>1</td> <td>male</td> <td>54.0</td> <td>0</td> <td>0</td> <td>51.8625</td> <td>S</td> <td>First</td> <td>man</td> <td>True</td> <td>E</td> <td>Southampton</td> <td>no</td> <td>True</td> </tr> <tr> <th>10</th> <td>1</td> <td>3</td> <td>female</td> <td>4.0</td> <td>1</td> <td>1</td> <td>16.7000</td> <td>S</td> <td>Third</td> <td>child</td> <td>False</td> <td>G</td> <td>Southampton</td> <td>yes</td> <td>False</td> </tr> <tr> <th>11</th> <td>1</td> <td>1</td> <td>female</td> <td>58.0</td> <td>0</td> <td>0</td> <td>26.5500</td> <td>S</td> <td>First</td> <td>woman</td> <td>False</td> <td>C</td> <td>Southampton</td> <td>yes</td> <td>True</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>871</th> <td>1</td> <td>1</td> <td>female</td> <td>47.0</td> <td>1</td> <td>1</td> <td>52.5542</td> <td>S</td> <td>First</td> <td>woman</td> <td>False</td> <td>D</td> <td>Southampton</td> <td>yes</td> <td>False</td> </tr> <tr> <th>872</th> <td>0</td> <td>1</td> <td>male</td> <td>33.0</td> <td>0</td> <td>0</td> <td>5.0000</td> <td>S</td> <td>First</td> <td>man</td> <td>True</td> <td>B</td> <td>Southampton</td> <td>no</td> <td>True</td> </tr> <tr> <th>879</th> <td>1</td> <td>1</td> <td>female</td> <td>56.0</td> <td>0</td> <td>1</td> <td>83.1583</td> <td>C</td> <td>First</td> <td>woman</td> <td>False</td> <td>C</td> <td>Cherbourg</td> <td>yes</td> <td>False</td> </tr> <tr> <th>887</th> <td>1</td> <td>1</td> <td>female</td> <td>19.0</td> <td>0</td> <td>0</td> <td>30.0000</td> <td>S</td> <td>First</td> <td>woman</td> <td>False</td> <td>B</td> <td>Southampton</td> <td>yes</td> <td>True</td> </tr> <tr> <th>889</th> <td>1</td> <td>1</td> <td>male</td> <td>26.0</td> <td>0</td> <td>0</td> <td>30.0000</td> <td>C</td> <td>First</td> <td>man</td> <td>True</td> <td>C</td> <td>Cherbourg</td> <td>yes</td> <td>True</td> </tr> </tbody> </table> <p>182 rows × 15 columns</p> </div> --> suprime beaucoups de données dans notre dataset (on passe de 891 lignes à 182 lignes) # 1. Diagnostiquer le dataset 1. Y-a-t'il des valeurs manquantes ? 2. Combien ? 3. Comment sont-elles réparties ? 4. Filtrer le jeu de données 5. Analyser les entrées pour tenter de comprendre pourquoi ces valeurs sont NaN ```python df.isna().sum(axis=0) ``` survived 0 pclass 0 sex 0 age 177 sibsp 0 parch 0 fare 0 embarked 2 class 0 who 0 adult_male 0 deck 688 embark_town 2 alive 0 alone 0 dtype: int64 ```python sns.heatmap(df.isna()) ``` <Axes: > ![png](dropna_9_1.png) ```python na_index = df[df.isna().any(axis=1)].index ``` # 2. Éliminer les NaN Objectif : conserver le plus de données possibles Il faut choisir entre les lignes et les colonnes. - Éviter d'éliminer une trop grande partie des lignes - Si plus de 50% d'une colonne est constituée de NaN, et si on plus elle ne présente aucune corrélation avec la target, autant éliminer cette colonne avant d'éliminer des lignes ```python df.drop(labels="deck", axis=1, inplace=True) ``` ```python sns.heatmap(df.isna()) ``` <Axes: > ![png](dropna_13_1.png)