Pandas – sorting
Pandas - sorting
Sorting by rows
Pandas has good sorting capabilities. When datasets are very large you want to sort it even by multiple columns.
To do this Pandas, there are the following ways:
sort_values()
: sort by one or more columns.sort_index()
: to sort by row index.
Both of these functions have the following options
- sorting the data frame in specific order (ascending or descending)
- sorting in place, sorting with missing values
- what to do with missing values.
We will be working with the dataset ‘world_loans’ to give some examples.
world_loans.sort_values(['Country'])
– sorts the data frame alphabetically by Country ascending, which is the default.world_loans.sort_values(['Country'], ascending=False)
– sorts the data frame alphabetically by Country, descending.
In our dataset we have multiple columns. Check the columns in the dataset: world_loans.columns
. Suppose we want to sort by the columns ‘Country’ (descending) and ‘Original Amount’ (ascending).
world_loans.sort_values(['Country', 'Original Amount'], ascending=[False, True])
- – the dataset is now first ordered by country and within each country the original loan amount is ordered.
If we want to make a change to the dataset permanently we use ‘inplace = True’. Like so:
world_loans.sort_values(['Country'], ascending=False, inplace = True)
– this changes the dataset.
Missing values
Sometimes your dataset has values that are not available. What do you want to do with that? Pandas give you the option to choose whether you want the missing values, NaN, at the beginning or in the end of the column.
na_position : [{‘first’, ‘last’}, default ‘last’]
: ‘first’ puts ‘NaNs’ at the beginning while ‘last’ puts NaNs at the end.
An example
world_loans.sort_values(['Country', na_position : ‘first’])
– This will sort the dataset by the ‘Country’ column and will show all the rows that do not have country name defined first.
import pandas as pd
world_loans = pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD06.csv")
world_loans = world_loans[['Project ID','Country', 'Guarantor','Original Amount','Type']]
world_loans = world_loans.set_index('Project ID')
world_loans['Original Amount'].astype(int)
sort_by_country =world_loans.sort_values(['Country'])
print ('Sort the dataset by Country and display the first 10 rows.')
print (sort_by_country[['Country', 'Guarantor', 'Type']].head(10))
print('')
print('')
sort_by_country_asc= world_loans.sort_values(['Country'], ascending=False)
print ('Sort the dataset in descending order by Country and display the first 10 rows.')
print (sort_by_country_asc.head(10))
print('')
print('')
sort_by_two_column_criteria = world_loans.sort_values(['Country', 'Original Amount'], ascending=[False, True])
print ('Sort the dataset in descending order by Country and ascending order by original amount. Then, display the first 10 rows.')
print(sort_by_two_column_criteria[['Country', 'Guarantor', 'Original Amount']].head(10))
Try Yourself
Try out some code yourself.
import pandas as pd
world_loans = pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD06.csv")
world_loans = world_loans[['Project ID','Status', 'Borrower','Repaid to IBRD','Type']]
world_loans = world_loans.set_index('Project ID')
#check out the new dataframe
world_loans.head()
#Sort the world_loans dataset by 'Borrower' AND 'Status'. Give this dataframe a new name called 'status'