Clean and extract text using string manipulation functions

team-member-1

Rishi Sapra

Technical Community leader, speaker, trainer and evangelist specialising in Power BI and Azure. Formally recognised by Microsoft as a Most Valuable Professional (MVP), Fast Track Recognised Solution Architect (FTRSA) and Microsoft Certified Trainer (MCT).

Tags:

View/Download Topic Example Workbook

Instructions: You can type values or formulas into cells above but to access Excel functionality, or if it is easier to work directly in Excel, open the file by clicking on the Excel Icon at the bottom.

If the file does not display correctly try right clicking in the area above and selecting ‘Refresh’. If you are still unable to view the file properly, or it contains unsupported features, click here to download a copy.

Overview

This topic is about using some of the text-based functions in Excel to manipulate text so that it can be compared to other text values, looked up within a range or so that relevant portions of text can be extracted from a longer text string. For example, one text function removes spaces either side of text and an application of this is for troubleshooting lookups – finding a value which you know exists in the lookup range but isn’t being picked up because of the additional spaces.

Other text functions have even broader applications such as extracting part of text from a longer text value, or substituting all instances of some text with something else. This can then be used for ‘data cleansing’, for example manipulating data which may have come out of other systems formatted in a slightly different way, in order to get it in the format you need to use it in your Excel workbooks.

The functions we’ll look at in this topic are:

TRIM-Gets rid of spaces at the beginning or end of some text

CLEAN-Removes special characters from a text value

SUBSTITUTE – Replaces all instances of a character or text string with another one that you specify

SEARCH – Looks for a particular value within a text string and returns the position (a number indicating how many characters into the string) where the value is found

LEFT/RIGHT – Takes the first/last specified number of characters from a text value

TRIM – getting rid of spaces at the beginning or end of text

The syntax of the TRIM function is =TRIM(text) – it takes a text value and returns the same value without any spaces which occur either before or after the text.

This is most useful for fixing lookups where you can see the values you’re looking up in the range but the Lookup function still returns #N/A. A possible reason for this is because of spaces sneaking into the text value, particularly those at the end of the text which are easy to miss because you can’t see them!

When using this with lookups though its important to TRIM both the value being looked up and the values in the range you’re looking up in as spaces in either of these will cause the lookup to fail. To trim values in the lookup range of cells then use the TRIM function in another column, and then copy and paste special values from this back into the lookup range.

CLEAN – getting rid of special characters

Sometimes data that comes out of some source systems for example contains special characters such as ¬

In this case we can use the CLEAN function to remove the special characters from the text. The syntax of the CLEAN function is =CLEAN(text). So =CLEAN(abc¬) returns abc. Or if the text containing a special character is stored in say cell C4 then the function would be =CLEAN(C4).

The full list of special characters which are removed by the CLEAN function are the first 32 ASCII characters which can be found using the CHAR() function. See the ‘CLEAN function characters’ sheet of the Topic Example Workbook for a list of these).

SUBSTITUTE – replacing occurrences of some text within text

The syntax of the SUBSTITUTE function is =SUBSTITUTE( text, old_text, new_text, [instance_num])

The ‘text’ is the text/cell reference that you want to look in, the ‘old_text’ is the value you want to replace and the ‘new_text’ is the value you want to replace it with. ‘Instance_num’ is an optional parameter if you don’t want to replace all occurrences of the text you can specify which occurrence to replace.

For example the function =SUBSTITUTE(“Excel 2003″,”2003″,”2007”) returns ‘Excel 2007’ – it looks at the text ‘Excel 2003’, searches for any occurrence of the word ‘2003’ (as a literal text string in speech marks) and replaces it with ‘2007’.
Note that the SUBSTITUTE function is case-sensitive – to make it non case-sensitive ensure that the text you search for is in CAPITAL LETTERS and then reference the Upper case version of the text you’re looking in by using the UPPER function. For example if cell A2 contains the text “Excel 2007” Then the function ‘=SUBSTITUTE(A2,”excel”,”word”)’ won’t work but the function =SUBSTITUTE(UPPER(A2),”EXCEL”, “Word”) will.

SEARCH – find position of text within text

The SEARCH function returns the position of some text (or a character) within a longer text string. This is often an invaluable tool, particularly when used in conjunction with LEFT/RIGHT to extract part of a text value.

The syntax is =SEARCH(find_text, within_text, [start_num]) – the ‘find_text’ is the text/characters that you want to look for (included in double quotes or as a cell reference), the ‘within_text’ is the text/cell reference you want to look in and the optional ‘start_num’ parameter is the starting position (character number) – if omitted it will start looking from the beginning of the within_text value.

For example, to find the position of the letter o within the text “Nicola” contained in cell A6, our function would read =SEARCH(“o”,B6) which returns 4, indicating that the letter o is the 4th character in the text string “Nicola”.

Searching for “r” in the text string “Barry” returns 3 – even though there are two instances of the letter r, the search function will return the first one if the Start_num parameter is omitted. To pick up the second instance we could have the specify the start num as starting from Character 4 (the first character after the first occurrence of r). To make this dynamic, a generic way to always pick up the second occurrence would be to double up the formula. For example if the text “Barry” is contained in cell A5 then the formula would be =SEARCH(“r”,A5,SEARCH(“r”,A5)+1) which would search for “r” starting from the character after the first occurrence (Character 4) – this returns 4 as the second occurrence of “r” is as the 4th letter.

If the value cannot be found then the SEARCH function will return #VALUE – you can capture these by using IFERROR(so for example returning the second instance if it exists, otherwise returning the first instance). Also note that the SEARCH function isn’t case-sensitive but there is a very similar function called FIND which is identical apart from the fact that FIND is case sensitive.

LEFT/RIGHT - taking the first/last specified number of characters from a text

The LEFT function returns the first specified number of characters in a text value and the RIGHT function returns the last specified number of characters. For example =LEFT(“Andrew”,4) returns “Andr” – the first 4 characters of the word – and RIGHT(“Andrew”,4) returns “drew” – the last 4 characters.

This can be combined with the SEARCH function to obtain a part of a text value up until or from the occurrence of a particular character/text value. For example to get the part of the text string “Andrew” up until and including the occurrence of the letter “e” we could write =LEFT(“Andrew”, SEARCH(“e”, “Andrew”)) which gives us “Andre” – i.e. the first X letters of the word where X represents the position of the occurrence of the letter “e”.

To reverse this and find the end part of the text value starting from the occurrence of the letter “e”, we need to find out how many characters occur after the letter “e”. SEARCH will give us the number of characters before it, so to find the reverse we need to know how many characters are in the whole text value and then subtract the position of the first occurrence of “e” from this. Thankfully there is a function in Excel to return the length of a text value- it’s called the LEN function. = LEN(“Andrew”) returns the number of characters in the text string “Andrew”, in this case 7. The function =LEN(A2)-SEARCH(“e”, “Andrew”) then gives us the number of characters after the occurrence of “e”, in this case 1. (the letter “w”). Wrapping the RIGHT function around this gives us the last character of the text string – =RIGHT(“Andrew”, LEN(“Andrew”)-SEARCH(“e”, “Andrew”)).

Try it Out

In the Topic Example Workbook, fill in the required MATCH formulas into the yellow cells:

Using TRIM

In cells E3/E4 use the MATCH function to lookup the values in cells D3/D4 in Column A. Because the Values in Column A contain spaces, the formulas should return a #N/A error
Populate the range B2:B6 with the trimmed value of the corresponding cells in Column A using the TRIM function
Now In cells F3/F4 use the MATCH function to lookup the corresponding values in cells D3/D4 in Column B. The formula in cell F3 should work (it should return 2) but because the value in cell D4 has an invisible space at the end, the formula in cell F4 should return #N/A

In cells G3/G4 use the same MATCH function to look up the trimmed version of the values in Cells D3/D4 in Column B. Now both formulas should return a value.
A demonstration is shown in the video clip below.

Using CLEAN

In Cell E8 use the MATCH function to lookup the value in cell D8 within Column B. Because this value contains a special character the lookup formula should return #N/A.

In Cell F8 look up the value in cell D8 with any special characters removed using the CLEAN function, looking within Column B. This should now return a value

A demonstration is shown in the video clip below.

Using SUBSTITUTE

In Cell E12 use the MATCH function to lookup the value in cell D12 within Column B. Because this value contains slashes either side the lookup formula should return #N/A.

In Cell F12 substitute any occurrences of a slash (“/”) in cell D12 with an empty string (“”) and then look up this value within B. This should now return a value

Using LEFT/RIGHT and SEARCH

In Cell E16 use the MATCH function to lookup the value in cell D16 within Column B. Because there is no occurrence of the full name (“Andrew Jones”) in Column B this formula should return #N/A.

In Cell F16 use the SEARCH function to find the first occurrence of a space (“ “) within the value in cell E16. Then use the LEFT function to return the part of text in cell E16 up to the position of the first space (which will return the first name only). Then use the MATCH function to lookup the first name only within Column B. This should now return a value

A demonstration is shown in the video clip below. Resist the temptation to view the answers file below until you have tried this!

Answers

Once you have completed the ‘Try it Out’ exercise, the completed Excel file should look like this:

If the file does not display correctly try right clicking in the area above and selecting ‘Refresh’. If you are still unable to view the file properly, or it contains unsupported features, click here to download a copy.

Links and other resources

YouTube Video – Excel 2007 Tutorial 13: Advanced Functions Part 1

YouTube Video – Change Text with Excel SUBSTITUTE Function