Pandas – Group by

Rishi Sapra

Technical Community leader, speaker, trainer and evangelist specialising in Power BI and Azure. Formally recognised by Microsoft as a Most Valuable Professional (MVP), Fast Track Recognised Solution Architect (FTRSA) and Microsoft Certified Trainer (MCT).

Tags:

Intro

When working with large datasets it can be handy categorize the dataset. This is where Pandas’ groupby() can be used.

This method essentially splits the data into different groups depending on a variable of choice. After splitting the data and combining them, we can apply certain functions on them such as sum, count, or average. Of course, the dataset must have a datatype that calculation can be made. The new object will be much shorter can compact. It can also be used to further visualize the dataset. Again, if you are familiar with SQL, this should be easy. Also, if you want to refresh, please check the following Excel pivotable how to here.
Follow the examples and try it out in the “Try it out”-section.

Dataset

We will be using two subsets of the loan dataset of the IBRD. Familiarise yourself with them. df1.head() and df2.head().

Examples

We start with a simple dataset with all the loans amounts per country (try out “df1.head()“).

Simple Grouping
In our dataset, we have a column of countries. In this column, a country may appear multiple times if it has more than one loan outstanding, the amount of which can be seen in the second column. We now want to know the total amount of of loans per country. To do so we group by country, ‘Country’, and sum the loan amouunt: ‘Original Amount’
df1.groupby(['Country'])['Original Amount'].sum()
This this not look nice so let’s convert it to a pandas dataframe, .to_frame(), and give it an index, .reset_index(). We call this dataframe df1_country_loans.
df1_country_loans = df1.groupby(['Country'])['Original Amount'].sum().to_frame().reset_index()

Multiple Grouping
We can group by more than one column. We show this using another prepared a dataset, df2, with the status of the loan amount. We group by country and then show the status of the loan:
df2.groupby(['Country','Status'])['Original Amount']
We can now state how what information we want about the loans. Let’s count how many loans each country has outstanding and create a new object called df2_loans_status.
df2_loans_status= df2.groupby(['Country','Status'])['Original Amount'].count()
Again we make it look nice by turning it into a dataframe and giving it an index.
df2_loans_status = df2_loans_status.to_frame().reset_index()



				import pandas as pd

df =pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD06.csv")

df1 =  df[['Country','Original Amount']]

df2 = df[['Country','Status','Original Amount']]

df3 = df[['Project ID', 'Project Name', 'Borrower', 'Repaid to IBRD']]



				

# Check df1 datadframe

print (df.head(10))

# Group by country and perform sum calculation on loan amount.

print (df1.groupby(['Country'])['Original Amount'].sum())

# Convert it to a pandas dataframe, and give it an index. Name it df1_country_loans.

df1_country_loans = df1.groupby(['Country'])['Original Amount'].sum().to_frame().reset_index()

print (df1_country_loans)

# Multiple Grouping

# Group by country and status of the loan

df2.groupby(['Country','Status'])['Original Amount']

# Create a new object called df2_loans_status.

df2_loans_status = df2.groupby(['Country','Status'])['Original Amount'].count()

print (df2_loans_status)

# Turn it into a dataframe and give it an index.

df2_loans_status = df2_loans_status.to_frame().reset_index()

print (df2_loans_status)

Try it out

We have another dataset, df3, which contains the project ID, project name, borrower and repaid amount. Group the total amount repaid to IBRD by Project name and Borrower



					import pandas as pd

df =pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD06.csv")

df3 = df[['Project ID', 'Project Name', 'Borrower', 'Repaid to IBRD']]

Getting Started – Variables and Data types

Rishi Sapra

Python – Basics

Rishi Sapra

Pandas – Series and Dataframes

Rishi Sapra

Leave a comment Cancel reply

You must be logged in to post a comment.

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.