Pandas – Group by

team-member-1

Rishi Sapra

Technical Community leader, speaker, trainer and evangelist specialising in Power BI and Azure. Formally recognised by Microsoft as a Most Valuable Professional (MVP), Fast Track Recognised Solution Architect (FTRSA) and Microsoft Certified Trainer (MCT).

Tags:

Intro

When working with large datasets it can be handy categorize the dataset. This is where Pandas’ groupby() can be used.

This method essentially splits the data into different groups depending on a variable of choice. After splitting the data and combining them, we can apply certain functions on them such as sum, count, or average. Of course, the dataset must have a datatype that calculation can be made. The new object will be much shorter can compact. It can also be used to further visualize the dataset. Again, if you are familiar with SQL, this should be easy. Also, if you want to refresh, please check the following Excel pivotable how to here.
Follow the examples and try it out in the “Try it out”-section.

Dataset

We will be using two subsets of the loan dataset of the IBRD. Familiarise yourself with them. df1.head() and df2.head().

Examples

We start with a simple dataset with all the loans amounts per country (try out “df1.head()“).

Simple Grouping
In our dataset, we have a column of countries. In this column, a country may appear multiple times if it has more than one loan outstanding, the amount of which can be seen in the second column. We now want to know the total amount of of loans per country. To do so we group by country, ‘Country’, and sum the loan amouunt: ‘Original Amount’
df1.groupby(['Country'])['Original Amount'].sum()
This this not look nice so let’s convert it to a pandas dataframe, .to_frame(), and give it an index, .reset_index(). We call this dataframe df1_country_loans.
df1_country_loans = df1.groupby(['Country'])['Original Amount'].sum().to_frame().reset_index()

Multiple Grouping
We can group by more than one column. We show this using another prepared a dataset, df2, with the status of the loan amount. We group by country and then show the status of the loan:
df2.groupby(['Country','Status'])['Original Amount']
We can now state how what information we want about the loans. Let’s count how many loans each country has outstanding and create a new object called df2_loans_status.
df2_loans_status= df2.groupby(['Country','Status'])['Original Amount'].count()
Again we make it look nice by turning it into a dataframe and giving it an index.
df2_loans_status = df2_loans_status.to_frame().reset_index()


import pandas as pd
df =pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD06.csv")
df1 = df[['Country','Original Amount']]
df2 = df[['Country','Status','Original Amount']]
df3 = df[['Project ID', 'Project Name', 'Borrower', 'Repaid to IBRD']]



# Check df1 datadframe
print (df.head(10))
# Group by country and perform sum calculation on loan amount.
print (df1.groupby(['Country'])['Original Amount'].sum())
# Convert it to a pandas dataframe, and give it an index. Name it df1_country_loans.
df1_country_loans = df1.groupby(['Country'])['Original Amount'].sum().to_frame().reset_index()
print (df1_country_loans)
# Multiple Grouping
# Group by country and status of the loan
df2.groupby(['Country','Status'])['Original Amount']
# Create a new object called df2_loans_status.
df2_loans_status = df2.groupby(['Country','Status'])['Original Amount'].count()
print (df2_loans_status)
# Turn it into a dataframe and give it an index.
df2_loans_status = df2_loans_status.to_frame().reset_index()
print (df2_loans_status)



Try it out

We have another dataset, df3, which contains the project ID, project name, borrower and repaid amount. Group the total amount repaid to IBRD by Project name and Borrower


import pandas as pd
df =pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD06.csv")
df3 = df[['Project ID', 'Project Name', 'Borrower', 'Repaid to IBRD']]






Leave a comment