Pandas – Data Structures
Introduction
Often when have large sets of data, we do not need all the data to make calculations. To select only a subset of a dataset Pandas has some very good functions. In the following How-to we will use a shortened dataset of the WorldBank.
To recap a DataFrame is composed of three different components, the index, columns, and the data. The data is also known as the values. In this case The index represents the sequence of values on the far left-hand side of the DataFrame.
Dataset
We have a small dataset of the World Bank (www.worldbank.org), the organisation that provides (development) loans to countries. This dataset is already loaded and is called ‘world’ and is done so in a Pandas DataFrame.
Explore
Let use the basic functions of Pandas to explore our data.
world.shape
outputs in bracket how many rows and columns the dataset has. This might good to
know so that you can decide whether or not to continue with the data project if the number of row/observations
are to small to do any sensible analysis.
world.head()
displays the column labels and the first five row of the dataset.
To see more or less than the default five rows, type the number in the brackets.
world.tail()
displays the column labels and the last five row of the dataset.
To see more or less than the default last five rows, type the number in the brackets.
len(world)
function tells the user how long the dataset is i.e. the number of rows.
Can you see how many rows does this dataset have?
world.describe()
makes some basic calculations of the dataset as long as the datatype is float or integer.
In this case we see calculation of interest rates. You should notice something strange here. Why is Pandas only
calculating thesestatistics for “interest rates”? Why not for the column “amount”? The “dtypes”-method answers
this question.
world.dtypes
displays the datatypes of the columns. As we can see the column “amount” has datatype
object. For this reason calculation cannot be performed. See how you can change datatypes here [insert link to ..]
world.columns.values
outputs all the names of the columns in the form of a list. This be handy if
your dataset has a large number of columns. This might not be displayed when using the ‘.head()’ function.
Also when you need to select certain columns this will show you the columns without the need to scroll to the right.
world.isnull()
shows the number of values that are empty per column in a Boolean type. This might
important to know so you can decide to continue with your data project in the first place. Perhaps the large number
of null values indicates that you should go to the data colecting phase. Or maybe certain columns should not be used
and be removed.
world.isnull().sum()
calculates the number of null values per column.
import pandas as pd
world = pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD11.csv")
print ('how many rows and columns does the dataset have')
print (world.shape)
print (‘====================’)print (‘Display the column names and the first five rows’)
print (world.head())
print (‘====================’)# display the column names and the last five rows
print (world.tail())
print (‘====================’)# display the column names and the first five rows
print (len(world))
print (‘====================’)
# basic calculations of the dataset if datatype is float or integer.
print(world.describe())
print (‘====================’)
# displays the datatypes of the columns.
print (world.dtypes)
print (‘====================’)
# outputs all the names of the columns in the form of a list.
print (world.columns.values)
print (‘====================’)
# shows the number of values that are empty per column in a Boolean type.
print (world.isnull().head())
print (‘====================’)
# calculates the total number of null values per column
print (world.isnull().sum())
print (‘====================’)
Try it out - Cars
This is a chance to try out how to explore a dataset. A dataset called ‘cars’ has been upload. Explore it as explained above.
Try to answer the following questions.
- Find out how many rows and columns cars-dataset has.
- What are the columns names?
- Are there any missing values?
- How many per column and how many in total?
import pandas as pd
cars = pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/cars02.csv")
cars.head()