Pandas – Data Structures

Rishi Sapra

Technical Community leader, speaker, trainer and evangelist specialising in Power BI and Azure. Formally recognised by Microsoft as a Most Valuable Professional (MVP), Fast Track Recognised Solution Architect (FTRSA) and Microsoft Certified Trainer (MCT).

Tags:

Introduction

Often when have large sets of data, we do not need all the data to make calculations. To select only a subset of a dataset Pandas has some very good functions. In the following How-to we will use a shortened dataset of the WorldBank.
To recap a DataFrame is composed of three different components, the index, columns, and the data. The data is also known as the values. In this case The index represents the sequence of values on the far left-hand side of the DataFrame.

Dataset

We have a small dataset of the World Bank (www.worldbank.org), the organisation that provides (development) loans to countries. This dataset is already loaded and is called ‘world’ and is done so in a Pandas DataFrame.

Explore

Let use the basic functions of Pandas to explore our data.
world.shape outputs in bracket how many rows and columns the dataset has. This might good to
know so that you can decide whether or not to continue with the data project if the number of row/observations
are to small to do any sensible analysis.

world.head() displays the column labels and the first five row of the dataset.
To see more or less than the default five rows, type the number in the brackets.

world.tail() displays the column labels and the last five row of the dataset.
To see more or less than the default last five rows, type the number in the brackets.

len(world) function tells the user how long the dataset is i.e. the number of rows.
Can you see how many rows does this dataset have?

world.describe() makes some basic calculations of the dataset as long as the datatype is float or integer.
In this case we see calculation of interest rates. You should notice something strange here. Why is Pandas only
calculating thesestatistics for “interest rates”? Why not for the column “amount”? The “dtypes”-method answers
this question.

world.dtypes displays the datatypes of the columns. As we can see the column “amount” has datatype
object. For this reason calculation cannot be performed. See how you can change datatypes here [insert link to ..]

world.columns.values outputs all the names of the columns in the form of a list. This be handy if
your dataset has a large number of columns. This might not be displayed when using the ‘.head()’ function.
Also when you need to select certain columns this will show you the columns without the need to scroll to the right.

world.isnull() shows the number of values that are empty per column in a Boolean type. This might
important to know so you can decide to continue with your data project in the first place. Perhaps the large number
of null values indicates that you should go to the data colecting phase. Or maybe certain columns should not be used
and be removed.

world.isnull().sum() calculates the number of null values per column.



import pandas as pd

world = pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD11.csv")





print ('how many rows and columns does the dataset have')

print (world.shape)

print (‘====================’)
print (‘Display the column names and the first five rows’)
print (world.head())print (‘====================’)
# display the column names and the last five rows
print (world.tail())print (‘====================’)
# display the column names and the first five rows
print (len(world))

print (‘====================’)
# basic calculations of the dataset if datatype is float or integer.
print(world.describe())

print (‘====================’)
# displays the datatypes of the columns.
print (world.dtypes)

print (‘====================’)

# outputs all the names of the columns in the form of a list.
print (world.columns.values)

print (‘====================’)
# shows the number of values that are empty per column in a Boolean type.
print (world.isnull().head())

print (‘====================’)

# calculates the total number of null values per column
print (world.isnull().sum())
print (‘====================’)

Try it out - Cars

This is a chance to try out how to explore a dataset. A dataset called ‘cars’ has been upload. Explore it as explained above.

Try to answer the following questions.

Find out how many rows and columns cars-dataset has.
What are the columns names?
Are there any missing values?
How many per column and how many in total?



import pandas as pd

cars = pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/cars02.csv")





cars.head()

Getting Started – Variables and Data types

Rishi Sapra

Python – Basics

Rishi Sapra

Pandas – Series and Dataframes

Rishi Sapra

Leave a comment Cancel reply

You must be logged in to post a comment.

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.