Clean Data with Spreadsheets

Let’s take a look at some techniques to clean up data directly in your favorite spreadsheet tool. We will use Google Sheets as an example, but the same principles will apply to most other software packages, such as Excel, Calc, or Numbers.

Find and Replace with a blank

Find and Replace tool is one of the most powerful data clean-up tools in spreadsheets. You can use it to remove thousands separators from numbers (to change 1,234,567 to 1234567) or to remove units of measure that sometimes reside in the same cells as numbers (321 kg -> 321). You can also use it to bulk-change spellings, for example to shorten, expand, or translate country names (Republic of India -> India, US -> United States, Italy -> Italia).

Let’s look at Find and Replace in practice. A common problem with US census data is that geographic names contain unnecessary words. For example, your data can look something like that:

Hartford town
New Haven town
Stamford town

But you want a clean list of towns, either to display in a chart, or to merge with a different dataset:

Hartford
New Haven
Stamford

We can use Find and Replace tool to remove the unwanted “town” part. You can download our sample file, which contains 169 Connecticut towns and their population, for the exercise.

  1. Select the column you want to modify by clicking on the column header. If you don’t, you will be searching and replacing in the entire spreadsheet.
  2. From Edit menu, choose Find and replace item. You will see the window like is shown in Figure 4.2.
  3. In the Find field, type town, without quotation marks and leave a space before the word. If you don’t leave the space, you will accidentally remove town from Newtown, and you will end up with trailing spaces which can cause troubles in the future.
  4. Leave the Replace with field blank.
  5. Search field should be set to the range you selected in step 1, or All sheets if you didn’t select anything.
  6. You have the option to match case. If checked, town and Town and tOwN will be treated differently. For our purpose, you can leave match case unchecked.
  7. Press the Replace all button. Since this sample file contains 169 towns, the window will state that 169 instances of “town” have been replaced.
  8. Inspect the resulting sheet. Make sure town names such as Newtown remained untouched.
Find and Replace window in Google Sheets.

Figure 4.2: Find and Replace window in Google Sheets.

Split data into two or more columns

Sometimes multiple pieces of data appear in a single cell, such as names (John Doe), coordinate pairs (40.12,-72.12), or addresses (300 Summit St, Hartford, CT, 06106). For your analysis, you might want to split them into separate entities, so that your FullName column (with John Doe in it) becomes FirstName (John) and LastName (Doe) columns, coordinates become Latitude and Longitude columns, and your FullAddress column becomes 4 columns, Street, City, State, and Zip (postcode).

Example 1

Let’s begin with a simple example of coordinate pairs. You can use our sample file, which is a collection of latitude-longitude pairs separated by a comma, with each pair on a new line.

  1. Select the data you wish to split, either the full column or just several rows. Note that you can only split data from one column at a time.
  2. Make sure there is no data in the column to the right of the one you’re splitting, because all data there will be written over.
  3. Go to Data and select Split text to columns, as in Figure 4.3.
  4. Google Sheets will try to guess your separator automatically. You will see that your coordinates are now split with the comma, and the Separator is set to Detect automatically in the dropdown. You can manually change it to a comma (,), a semicolon (;), a period (.), a space character, or any other custom character (or even a sequence of characters — more about that in Example 2 of this section).
  5. You can rename columns into Longitude (first number) and Latitude (second number).
Select Data - Split text to columns to automatically separate data.

Figure 4.3: Select Data - Split text to columns to automatically separate data.

Example 2

Now, let’s look at a slightly more complicated example. Imagine your dataset is structured as follows:

| Location                          |
| --------------------------------- |
| 300 Summit St, Hartford CT--06106 |
| 1012 Broad St, Hartford CT--06106 |
| 37 Alden St, Hartford CT--06114   |

Each cell contains a full address, but you want to split it into four cells: street address (300 Summit St), city (Hartford), state (CT), and zipcode (06106). Notice that the separator between the street and the rest of the address is a comma, a separator between the city and state is a space, and there are two dash lines between state and zipcode.

  1. Start splitting left to right. So your first separator will be a comma. Select your column (or one or more cells within that column), and go to Data > Split text to columns.
  2. Google Sheets should automatically split your cell into two parts, 300 Summit St and Hartford CT--06106, using comma as a separator. (If it didn’t, just select Comma from the dropdown menu that appeared).
  3. Now, select only the second column and perform Split text to columns. You will see that the city is now separate from the state and zipcode, and Google Sheets chose space as a separator (if it didn’t, choose Space from the dropdown menu).
  4. Next, select only the third column and perform Split text to columns again. Google Sheets won’t recognize -- as a separator, so you will have to manually select Custom, type -- in the text field, and hit Enter. You should now have four columns.

Tip: Google Sheets will treat zipcodes as numbers and will delete leading zeros (so 06106 will become 6106). To fix that, select the column, and go to Format > Number > Plain text. Now you can manually re-add zeros. If your dataset is large, consider concatenating 0s using the formula introduced in the following section.

Combine separate columns into one

Now, let’s see how to perform the reverse action. Imagine you receive address data in separate columns, formatted like this:

| Street        | City       | State  | Zip   |
| ------------- | ---------- | ------ | ----- |
| 300 Summit St | Hartford   | CT     | 06106 |

The data comes is four columns: street address, city, state, and zipcode. Let’s say your mapping tool requires you to combine all of this terms into one location column, like that:

| Location                          |
| --------------------------------- |
| 300 Summit St, Hartford, CT 06106 |

You can write a simple formula to combine (or concatenate) terms using ampersands (&) as cells values connectors, and quoted spaces (" "), or spaces with commas (", "), or a dash with spaces on both sides (" - "), or anything else as term separators.

For example, imagine that a spreadsheet contains an address that is separated into four columns—Address, City, State, and Zip—as shown in columns A-D in Figure 4.4. In column E, you can add new header named Location and insert a formula in this format, to combine the items using ampersands (&) and separating them with commas (", ") or quoted spaces (" "), like this: =A2 & ", " & B2 & ", " & C2 & " " & D2.

Use ampersands to combine items and separate them with spaces.

Figure 4.4: Use ampersands to combine items and separate them with spaces.

Note: Lisa Charlotte Rost from Datawrapper has written a brilliant blog post talking about data preparation for charting and analysis in Google Sheets, which we recommend for further reading.

You are now able to split data to columns using custom separators, and concatenate values from different cells into one. But what if your table is trapped inside a PDF? In the next section, we will introduce Tabula and show you how to convert tables from PDF documents into tables that you can analyze in Google Sheets, Microsoft Excel, or similar packages.