Pandas – Concatenate
Introduction
In this ‘how-to’, we will show several ways of combining data using Pandas. If you have some experience with SQL, this should be very simple. If not, that is OK. Just try to follow the examples and try it out in the “Try it out”- section.
We prepared two datasets and show how to ‘concatenate’ them.
Dataset
We have two datasets, df1 and df2, and we want to combine them into one dataset. These are two subset of a large loans dataset that we have been using earlier. They contain the same type of data.
Check out the length of each data set with the len()
and explore them further with head()
method.
import pandas as pd
df1 =pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD01.csv")
df2 =pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD02.csv")
print (df1.head())
print (df2.head())
Simple concatenation of two dataframes
Now that you are familiar with both data frames let’s make them into one dataframe. We will combine both datasets row-wise.
print (len(df1))
: Check how many rows the first dataset has.
print (df1.head())
: Show the first five rows of the first dataset.
print (len(df2))
: Check how many rows the second dataset has.
print (df2.tail())
: Show the last five rows of the second dataset.
df3 = pd.concat([df1, df2])
: Concatenate both datasets.
print (df3.head())
: Show the first five rows of the third dataset. This should be the same as the the first five rows of the first dataset.
print (df3.tail())
Show the last five rows of the third dataset. The last five rows of the second dataset.
print (len(df3))
: Show the number of rows of the third dataset.
We could have written this as.
df3 = pd.concat([df1, df2], axis=0)
, which have given the same result. Try it out!
As you might have noticed, the index shows some duplication. Let’ set the column “Project ID” as the index. df3.set_index('Project ID')
.
import pandas as pd
df1 =pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD01.csv")
df2 =pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD02.csv")
len(df1)
len(df2)
# Check how many rows the first dataset has
print (len(df1))
# Explore the first dataset
print (df1.head())
# Check how many rows the second dataset has
print (len(df2))
# Explore the first dataset
print (df2.tail())
# Concatenate both dataframes
df3 = pd.concat([df1, df2])
# check the first five rows
print (df3.head())
# check the last five rows
print (df3.tail())
# How many rows does the new dataset have
print (len(df3))
# Give the new dataset a more understandable index
df3.set_index('Project ID', inplace=True)
df3
Try it out
In this section you can try out some of the code yourself.
We have prepared two datasets.
- a dataset with Project ID, the Country and the Borrower: df_country_borrower
- a dataset with Project ID, Project name and interest rate: df_name_interest
After exploring them (tip: use head() only as they can be a bit large) try to do the following excercises.
Try to
- Concatenate both datesets column wise.
- Remove the duplicate column Project ID.
import pandas as pd
df1 =pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD06.csv")
df_country_borrower = df1[['Project ID', 'Country', 'Borrower']].head(15)
df_name_interest = df1[['Project ID', 'Project Name', 'Interest Rate']].head(15)
print (df_country_borrower )
print (df_name_interest)