Pandas – Duplicates
Introduction
Some times we have a dataset with duplicate rows. Rather then filtering and removing one by one, Pandas has a very handy method '.drop_duplicates()'
to do this in one go.
Dataset
We wil be using a subset of the loans dataset of the IBRD. The Pandas dataframe is called ‘df’. Check out the dataset df.head()
Remove duplicates
First lets check the length of the dataframe len(df)
. There are 30 rows in the dataframe. Each ‘Project ID’ is unique. Check how many unique ‘Project ID’ there are: len(df['Project ID'].unique())
Drop when whole row is duplicated.
Make a new dataframe, called ‘df_no_duplicates’, by droppping duplicates of the whole row:
df_no_duplicates = df.drop_duplicates()
Check the new dataframe to see if the rows with duplicates are dropped.
len(df_no_duplicates)
Drop row when certain values in the row are duplicated.
Suppose we wanted to drop the duplicates based on a particular column. In the case of this dataset, certain countries appear more than once. We create a new datasest df_unique_countries
df_unique_countries = df.drop_duplicates(['Country'], keep='first')
. This should keep the first row of each country and drop the other rows appearing after that.
df_unique_countries
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD06.csv")
df = df[['Project ID', 'Country']].head(10)
df = pd.concat([df]*3, ignore_index=True)
# DROP ROW WHEN WHOLE ROW IS DUPLICATED
# check the length of the dataset
print (len(df))
# Check how many unique ‘Project ID’ are in the dataset
print (len(df['Project ID'].unique()))
# make new dataframe by dropping duplicates
# call it df_no_duplicates
df_no_duplicates = df.drop_duplicates()
# check the length of the new dataframe
print (len(df_no_duplicates))
# DROP ROW BASED ON COLUMNS
df_unique_countries = df.drop_duplicates(['Country'], keep='first')
df_unique_countries
Try it out
Check the dataframe we prepared: df_new. There are lot of row duplications.
- Make a new dataframe called
df_new_Project_ID
whereby all duplicate “Project ID” as removed. - Make a new dataframe called
df_new_two_column
by removing the duplicates of ‘Project Name’ and ‘Interest Rate’.
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD06.csv")
df_new = df[['Project ID','Project Name', 'Agreement Signing Date','Interest Rate']]