Pandas – Replacing Values
Introduction
We have already seen that detecting missing values and filling them are important steps in the data cleaning process. Just as important is correcting certain data points by replacing them with correct values.
With Pandas a user can use different techniques to replaces certain values. We will highlight a few of them in this section.
Dataset
We wil be using a subset of the loans dataset of the IBRD. The Pandas dataframe is called ‘df’. Check out the dataset df.head()
. In this dataset you will see that there are some values ‘test’.
Replacing Values
- Use
.replace()
to replace the same value in the whole dataframe
Suppose in our data set we have some values that do not make sense. For example, in our dataset “df”, in columns “Borrower”
and “Guarantor” we see the value “test”. In the shelll, type df.head()
and press enter to confirm this. It could be that these values were input for different reasons; typos or intentionally.
We replace these ‘test’ values as follows:
df.replace('test','missing value',inplace=True)
Press the “run” button to run the code to check if the changes have taken place.
- Replace values in a column using a dictionary.
Now we will replace the words within a specific column rather than the whole dataset. In the column ‘Borrower’ we will replace ‘missing value’ with ‘Borrower missing’.
df.replace({'Borrower':{'missing value':'Borrower missing'}}, inplace=True)
remove the ‘#’ sign on line 4 and line 5 thenpress the ‘run’ button. You might want to delete all the line above first or place ‘#’ in the beginning of line 1, 2 and 3.
df.head()
The shell now shows the new dataframe where the ‘missing values’ are replaced with ‘Borrower missing’.
- Replace multiple values using a dictionary
So far we only replaced one value with another. Pandas gives us the possibility to replace multiple values. In our example, we will replace in the ‘Borrower’ column ‘CREDIT NATIONAL’ with ‘CREDIT NAT.’ and ‘MINISTERE DES FINANCES’ with ‘MIN. DES FINANCES’:
df.replace({'Borrower':{'CREDIT NATIONAL':'CREDIT NAT.', 'MINISTERE DES FINANCES':'MIN. DES FINANCES'}}, inplace=True)
Replace the ‘#’ sign at the beginning of line 6 and line 7 then press the ‘run’ button. Can you see the changes.
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/IBRD06.csv")
# Data
# We will use a subset of the IBRD dataset
df = df[['Country','Borrower','Guarantor']]
df.replace(np.NaN,'test',inplace=True)
df.replace('test','missing value',inplace=True)
df.head()
#df.replace({'Borrower':{'missing value':'Borrower missing'}}, inplace=True)
#df.head()
#df.replace({'Borrower':{'CREDIT NATIONAL':'CREDIT NAT.', 'MINISTERE DES FINANCES':'MIN. DES FINANCES'}}, inplace=True)
#df.head()
Try it out
Try the above code yourself on a different dataset. The dataset includes a user set of mobile phones, their platform, and the customer segment. See if you can do the exercises below. Before doing this, take a glance of the dataset set by typing in the shell mobile.head()
.
- Replace all value that are ‘6’ with ‘6.2’.
- Replace all values ‘4.1’ in column ‘platform_version’ with values ‘deprecated’.
- Using a dictionary replace values in column ‘use_type_id’ as follows:
- ‘1’ replace by ‘good payer’
- ‘2’ replace by ‘need reminders’
- ‘3’ replace by ‘in default’
- ‘6.2’ replace by ‘previous customer’
mobile = pd.read_csv(“https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/user_device.csv”)
import pandas as pd
mobile = pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/user_device.csv")
import pandas as pd
mobile = pd.read_csv("https://raw.githubusercontent.com/naveen1973/data-analysis-and-visualization-using-python/master/user_device.csv")