Clean Data with Spreadsheets
Let’s take a look at some techniques to clean up data directly in your favorite spreadsheet tool. We will use Google Sheets as an example, but the same principles will apply to most other software packages, such as Excel, Calc, or Numbers.
Find and Replace with a blank
Find and Replace tool is one of the most powerful data clean-up tools in spreadsheets.
You can use it to remove thousands separators from numbers (to change
or to remove units of measure that sometimes reside in the same cells as numbers
321 kg ->
321). You can also use it to bulk-change spellings, for example
to shorten, expand, or translate country names (
Republic of India ->
Let’s look at Find and Replace in practice. A common problem with US census data is that geographic names contain unnecessary words. For example, your data can look something like that:
Hartford town New Haven town Stamford town
But you want a clean list of towns, either to display in a chart, or to merge with a different dataset:
Hartford New Haven Stamford
We can use Find and Replace tool to remove the unwanted “town” part. You can download our sample file, which contains 169 Connecticut towns and their population, for the exercise.
- Select the column you want to modify by clicking on the column header. If you don’t, you will be searching and replacing in the entire spreadsheet.
- From Edit menu, choose Find and replace item. You will see the window like is shown in Figure 4.2.
- In the Find field, type
town, without quotation marks and leave a space before the word. If you don’t leave the space, you will accidentally remove town from Newtown, and you will end up with trailing spaces which can cause troubles in the future.
- Leave the Replace with field blank.
- Search field should be set to the range you selected in step 1, or All sheets if you didn’t select anything.
- You have the option to match case. If checked,
tOwNwill be treated differently. For our purpose, you can leave match case unchecked.
- Press the Replace all button. Since this sample file contains 169 towns, the window will state that 169 instances of “town” have been replaced.
- Inspect the resulting sheet. Make sure town names such as Newtown remained untouched.
Split data into two or more columns
Sometimes multiple pieces of data appear in a single cell,
such as names (
John Doe), coordinate pairs (
or addresses (
300 Summit St, Hartford, CT, 06106). For your analysis,
you might want to split them into separate entities, so that your FullName
John Doe in it) becomes FirstName (
John) and LastName (
coordinates become Latitude and Longitude columns, and your FullAddress column becomes
4 columns, Street, City, State, and Zip (postcode).
Let’s begin with a simple example of coordinate pairs. You can use our sample file, which is a collection of latitude-longitude pairs separated by a comma, with each pair on a new line.
- Select the data you wish to split, either the full column or just several rows. Note that you can only split data from one column at a time.
- Make sure there is no data in the column to the right of the one you’re splitting, because all data there will be written over.
- Go to Data and select Split text to columns, as in Figure 4.3.
- Google Sheets will try to guess your separator automatically. You will see that your coordinates
are now split with the comma, and the Separator is set to Detect automatically in the dropdown.
You can manually change it to a comma (
,), a semicolon (
;), a period (
.), a space character, or any other custom character (or even a sequence of characters — more about that in Example 2 of this section).
- You can rename columns into Longitude (first number) and Latitude (second number).
Now, let’s look at a slightly more complicated example. Imagine your dataset is structured as follows:
| Location | | --------------------------------- | | 300 Summit St, Hartford CT--06106 | | 1012 Broad St, Hartford CT--06106 | | 37 Alden St, Hartford CT--06114 |
Each cell contains a full address, but you want to split it into four cells: street address (300 Summit St), city (Hartford), state (CT), and zipcode (06106). Notice that the separator between the street and the rest of the address is a comma, a separator between the city and state is a space, and there are two dash lines between state and zipcode.
- Start splitting left to right. So your first separator will be a comma. Select your column (or one or more cells within that column), and go to Data > Split text to columns.
- Google Sheets should automatically split your cell into two parts,
300 Summit Stand
Hartford CT--06106, using comma as a separator. (If it didn’t, just select Comma from the dropdown menu that appeared).
- Now, select only the second column and perform Split text to columns. You will see that the city is now separate from the state and zipcode, and Google Sheets chose space as a separator (if it didn’t, choose Space from the dropdown menu).
- Next, select only the third column and perform Split text to columns again. Google Sheets won’t recognize
--as a separator, so you will have to manually select Custom, type
--in the text field, and hit Enter. You should now have four columns.
Tip: Google Sheets will treat zipcodes as numbers and will delete leading zeros (so 06106 will become 6106). To fix that, select the column, and go to Format > Number > Plain text. Now you can manually re-add zeros. If your dataset is large, consider concatenating 0s using the formula introduced in the following section.
Combine separate columns into one
Now, let’s see how to perform the reverse action. Imagine you receive address data in separate columns, formatted like this:
| Street | City | State | Zip | | ------------- | ---------- | ------ | ----- | | 300 Summit St | Hartford | CT | 06106 |
The data comes is four columns: street address, city, state, and zipcode. Let’s say your mapping tool requires you to combine all of this terms into one location column, like that:
| Location | | --------------------------------- | | 300 Summit St, Hartford, CT 06106 |
You can write a simple formula to combine (or concatenate) terms
using ampersands (
&) as cells values connectors,
and quoted spaces (
" "), or spaces with commas (
or a dash with spaces on both sides (
" - "), or anything
else as term separators.
For example, imagine that a spreadsheet contains an address that is separated into
four columns—Address, City, State, and Zip—as shown in columns A-D in
Figure 4.4. In column E, you can add new header named
Location and insert a formula in this format, to combine the items using
&) and separating them with commas (
", ") or quoted spaces (
" "), like
=A2 & ", " & B2 & ", " & C2 & " " & D2.
Note: Lisa Charlotte Rost from Datawrapper has written a brilliant blog post talking about data preparation for charting and analysis in Google Sheets, which we recommend for further reading.
You are now able to split data to columns using custom separators, and concatenate values from different cells into one. But what if your table is trapped inside a PDF? In the next section, we will introduce Tabula and show you how to convert tables from PDF documents into tables that you can analyze in Google Sheets, Microsoft Excel, or similar packages.