Pandas – Missing Values
Introduction
When collecting data, it can happen that values will be missing. There can be several reasons for this, data was purposely not given, typos made by the data auditor, etc. etc.
Within Pandas, a null value is considered missing and is denoted by ‘NaN’. In this how-to we will see how to detect null values and give some alternatives.
Dataset
We will work a subset of the dataset the IBRD. Take a quick look in the console to familiarise yourself with it: df.head()
Detecting missing values
First, let’s find out which are the values that are missing buy using: ‘.isnull()
‘. We use ‘.head()
‘ to show only the first five rows as the dataset is large.
print (df.isnull().head())
.
The table resulting from this command shows all values that are NOT missing is “False” while the missing values are “True”.
A better way of investigating this, especially when the dataset is large, is to create a table that shows the number of missing values per column: df.isnull().sum()
.
In our case, we see that there are 30 rows with missing values. Let’s check them out further and see how we can deal with, by making a new dataset with only the missing values:
df1 = df[df.isnull().any(axis=1)]
print (len(df1))
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD06.csv")
df = df[['Project ID', 'Borrower','Original Amount','Interest Rate']]
df['Interest Rate'] = df['Interest Rate'].replace(4.5, np.nan)
print (df.isnull().head())
print (df.isnull().sum())
df1 = df[df.isnull().any(axis=1)]
print ("The length of dataframe 'df1' is " + (str(len(df1))) + " rows")
;