Pandas – Duplicates

team-member-1

Rishi Sapra

Technical Community leader, speaker, trainer and evangelist specialising in Power BI and Azure. Formally recognised by Microsoft as a Most Valuable Professional (MVP), Fast Track Recognised Solution Architect (FTRSA) and Microsoft Certified Trainer (MCT).

Tags:

Introduction

Some times we have a dataset with duplicate rows. Rather then filtering and removing one by one, Pandas has a very handy method '.drop_duplicates()' to do this in one go.

Dataset

We wil be using a subset of the loans dataset of the IBRD. The Pandas dataframe is called ‘df’. Check out the dataset df.head()

Remove duplicates

First lets check the length of the dataframe len(df). There are 30 rows in the dataframe. Each ‘Project ID’ is unique. Check how many unique ‘Project ID’ there are: len(df['Project ID'].unique())

Drop when whole row is duplicated.
Make a new dataframe, called ‘df_no_duplicates’, by droppping duplicates of the whole row:
df_no_duplicates = df.drop_duplicates()
Check the new dataframe to see if the rows with duplicates are dropped.
len(df_no_duplicates)

Drop row when certain values in the row are duplicated.
Suppose we wanted to drop the duplicates based on a particular column. In the case of this dataset, certain countries appear more than once. We create a new datasest df_unique_countries
df_unique_countries = df.drop_duplicates(['Country'], keep='first'). This should keep the first row of each country and drop the other rows appearing after that.
df_unique_countries


import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD06.csv")
df = df[['Project ID', 'Country']].head(10)
df = pd.concat([df]*3, ignore_index=True)
 



# DROP ROW WHEN WHOLE ROW IS DUPLICATED
# check the length of the dataset
print (len(df))
# Check how many unique ‘Project ID’ are in the dataset
print (len(df['Project ID'].unique()))
# make new dataframe by dropping duplicates
# call it df_no_duplicates
df_no_duplicates = df.drop_duplicates()
# check the length of the new dataframe
print (len(df_no_duplicates))
# DROP ROW BASED ON COLUMNS
df_unique_countries = df.drop_duplicates(['Country'], keep='first')
df_unique_countries
 




Try it out

Check the dataframe we prepared: df_new. There are lot of row duplications.

  • Make a new dataframe called df_new_Project_ID whereby all duplicate “Project ID” as removed.
  • Make a new dataframe called df_new_two_column by removing the duplicates of ‘Project Name’ and ‘Interest Rate’.

 


import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD06.csv")
df_new = df[['Project ID','Project Name', 'Agreement Signing Date','Interest Rate']]
 


 




Leave a comment