Pandas – Merge
Intro
When performing data analysis we often have more than one dataset. These datasets are linked and we want to make it ‘merge’ them into one dataset for our data analysis.
Pandas has a “merge” method to bring two, or more, datasets together and create one dataset. You might need this more than you think so it is good to have a solid understanding of this. The merge() method in Pandas is similar to database “join” operation in SQL and Excel lookup functions. If you are not familiar with these, that is not a problem. We will give a full explanation.
Let’s explain several types of ‘joins’:
- Inner join: This type of join contains rows from the dataset that have matching values. In case of one the datasets have duplicate rows, that row will be showed twice in the result set.
- Left join: The Left join (also called Left Outer join ) gives all rows from both tables the same as an INNER join result set. Additionally, any row from the left dataset that does not have a matching row that exists in the right table will also be included in the result set. However, these will show “NaN” values.
- Right join: The Right join (also called Right Outer join ) gives all rows from both tables the same as an INNER join result set. Additionally, any rows from the right dataset that do not have a matching row that exists in the left dataset will also be included in the result set. However, these will show “NaN” values.
- Outer join: The Outer join combines the results of both left and right joins. The joined dataset will contain all records from both the tables and fill in “Nan” values for missing matches on either side.
Dataset
We have two simple datasets:
- List of countries with a Project ID.
- List of Projects and their status.
If you want to see how they look you can print them out.
print (loans_country)
print (loans_status)
Press the “run” button.
import pandas as pd
loans_country = pd.DataFrame({'Project ID':[6,2,3,4,5,8],'Country': ['Turkey','France', 'Denmark', 'Belgium', 'Germany', 'Italy'],})
loans_status = pd.DataFrame({'Status':['Cancelled','Outstanding','Reimbursed','Terminated','Paid','Signed'], 'Project ID':[1,2,3,4,5,9]})
How to merge in Pandas
In the console below we prepared some variables that join both datasets in several ways. Before pressing the run button, you might want to put some “#” sign for the code you do NOT wish to run. It makes it easier to see per join what Pandas is doing.
- Left merge:
left_merge = pd.merge(loans_country, loans_status, how='left')
Pandas, takes the dataset that is mentioned first, loans_country, and displays all rows in the new dataframe. Pandas automatically detects the columns that both datasets have in common: Project_ID. All Project ID and countries in the first dataset are shown. The Project ID’s that are not available in the dataset “loans_status” now show “NaN” in the status column. - Right merge:
pd.merge(loans_country, loans_status, how='right')
. Pandas, takes the dataset dat is mentioned second, loans_status, and displays all rows in the new dataset. Pandas automatically detects the columns that both datasets have in common: Project_ID. All “Project_ID” and countries in the second dataset are shown. The Project ID’s that are not available in “loans_country” now show “NaN”. - Inner merge:
pd.merge(loans_country, loans_status, how='inner')
, displays all the rows where there no “NaN” value in the “status” column in either dataset. - Outer merge
:pd.merge(loans_country, loans_status, how='outer')
, displays all the rows including where there is a “NaN” value in the “status” column.
import pandas as pd
loans_country = pd.DataFrame({'Project ID':[6,2,3,4,5,8],'Country': ['Turkey','France', 'Denmark', 'Belgium', 'Germany', 'Italy'],})
loans_status = pd.DataFrame({'Status':['Cancelled','Outstanding','Reimbursed','Terminated','Paid','Signed'], 'Project ID':[1,2,3,4,5,9]})
left_merge = pd.merge(loans_country, loans_status, how='left')
right_merge = pd.merge(loans_country, loans_status, how='right')
inner_merge = pd.merge(loans_country, loans_status, how='inner')
outer_merge = pd.merge(loans_country, loans_status, how='outer')
print ("Join on the left dataset. Shows all rows of loans_country")
print (left_merge)
print ("Join on the right dataset. Shows all rows of loans_status")
print (right_merge)
print ("Shows all rows of loans_country and loans_status")
print (inner_merge)
print ("Shows the rows of both datasets")
print (outer_merge)
Try it Yourself
We have made two small datasets: projects and project_status. Use the merge options we discussed above and see what you get.
import pandas as pd
projects = pd.DataFrame({'Project ID':[4,7,3,1,2,9],'Type': ['Water','Food', 'Agriculture', 'Education', 'Trade', 'Sanitation'],})
project_status = pd.DataFrame({'Project ID':[1,6,3,4,5,9], 'Status':['approved','waitlist','ended','unkown','waitlist','approved']})
print (projects)
print (project_status)