Clean Data with OpenRefine

Consider a dataset that looks like the one in Figure 4.7. Can you spot any problems with it?

First 20 rows of the sample dataset. Can you spot any problems with it?

Figure 4.7: First 20 rows of the sample dataset. Can you spot any problems with it?

Notice how the funding amounts (last column) are not standardized. Some amounts have commas as thousands separators, some have spaces, and some start with a dollar character. Notice also how the Country column includes various spellings of North and South Korea. Datasets like this can be an absolute nightmare to analyze. Luckily, OpenRefine provides powerful tools to clean up and standardize such data.

Note: This data exerpt is from US Overseas Loans and Grants (Greenbook) dataset, which shows US economic and military assistance to various countries. We chose to only include assistance to South Korea and North Korea for the years between 2000 and 2018. We added deliberate misspellings and formatting issues for demonstration purposes (although we did not alter values).

Download this sample dataset or use your own file with messy data. Inspect the file in any spreadsheet software. You can see that the dataset has four columns: year (between 2000 and 2018, inclusive), country (North or South Korea), a US funding agency, and funding amount (in 2018 US dollars). Let’s now use OpenRefine to clean it up.

Set up OpenRefine

You can download a copy of OpenRefine for Linux, Mac, or Windows from the official download page. Just like Tabula, it runs in your browser and no data leaves your local machine, which is great for confidentiality.

If you work on a Mac, the downloaded file will be a .dmg file. You will likely encounter a security message that will prevent OpenRefine from launching. Go to System Preferences -> Security and Privacy, and hit Open Anyway button in the lower half of the window. If prompted with another window, click Open.

If you use Windows, unzip the downloaded file. Double-click the .exe file, and OpenRefine should open in your default browser.

Once launched, you should see OpenRefine in your browser with 127.0.0.1:3333 address (localhost, port 3333), like shown in Figure 4.8.

OpenRefine starting page.

Figure 4.8: OpenRefine starting page.

Load Data and Start a New Project

To begin cleaning up your messy dataset, you should load it into a new project. OpenRefine lets you upload a dataset from your local machine, or a remote URL on the web (including a Google Spreadsheet), or copy/paste data into a text field. OpenRefine is able to extract data directly from SQL databases, but this is beyond the scope of this book. We assume that you downloaded the sample dataset we provided (or you are using your own file), so let’s load it from your computer.

  1. Under Get data from: This computer, click Browse… and select the file. Click Next.
  2. Before you can start cleaning up data, OpenRefine allows you to make sure data is parsed properly. In our case, parsing means the way the data is split into columns. Make sure OpenRefine assigned values to the right columns, or change setting in Parse data as block at the bottom of the page until it starts looking meaningful, like shown in Figure 4.9.
  3. Hit Create Project in the upper-right corner.
OpenRefine parsing options.

Figure 4.9: OpenRefine parsing options.

Now when you’ve successfully read the data into a new project, let’s start the fun part: converting text into numbers, removing unnecessary characters, and fixing the spellings for North and South Koreas.

Convert Dollar Amounts from Text to Numbers

Once your project is created, you will see the first 10 rows of the dataset. You can change it to 5, 10, 25, or 50 by clicking the appropriate number in the header

Each column header has its own menu (callable by clicking the arrow-down button). Left-aligned numbers in a column are likely represented as text (as is the case with FundingAmount column in our example), and they need to be transformed into numeric format.

  1. To transform text into numbers, open the column menu, and go to Edit cells > Common transforms > To number.
  2. You will see that some numbers became green and right-aligned (success!), but most did not change. That is because dollar sign ($) and commas (,) confuse OpenRefine and prevent values to be converted into numbers.
  3. Let’s remove $ and , from the FundingAmount column. In the column menu, choose Edit cells > Transform. In the Expression window, type value.replace(',', '') and notice how commas disappear in the preview window. When you confirm your formula works, click OK.
  4. Now, repeat the previous step, but instead of a comma, remove the $ character. (Your expression will become value.replace('$', '')).
  5. In steps 3 and 4, we replaced text (string) values with other string values, making OpenRefine think this column is no longer numeric. As a result, all values are once again left-aligned and in black. Perform step 1 again to see that all but three cells turning green (successfully converting to numeric). Now we need to remove spaces and an a character at the end of one number. Fix those manually by hovering over cells, and clicking the edit button (in the new popup window, make sure to change Data type to number, and hit Apply, like in Figure 4.10).
Manually remove spaces and extra characters, and change data type to number.

Figure 4.10: Manually remove spaces and extra characters, and change data type to number.

At this point, all funding amounts should be clean numbers, right-aligned and colored in green. We’re ready to move on to the Country column and fix different spellings of Koreas.

Cluster Similar Spellings

When you combine different data sources, or process survey data where respondents wrote down their answers as opposed to selecting them from a dropdown menu, you might end up with multiple spellings of the same word (town name, education level – you name it!). One of the most powerful features of OpenRefine is the ability to cluster similar responses.

If you use our original sample file, take a look at the Country column and all variations of North and South Korea spellings. From Country column’s dropdown menu, go to Facet > Text facet. This will open up a window in the left-hand side with all spellings (and counts) of column values. 26 choices for a column that should have just two distinct values, North Korea and South Korea!

  1. To begin standardizing spellings, click on the arrow-down button of Country column header, and choose Edit cells > Cluster and edit. You will see a window like the one shown in Figure 4.11.
  2. You will have a choice of two clustering methods, key collision or nearest neighbor. Both methods can be powered by different functions, but let’s leave the default key collision with fingerprint function.
  3. OpenRefine will calculate a list of clusters. Values in Cluster column contains grouped spellings that OpenRefine considers the same. If you agree with a grouping, check the Merge? box, and assign the “true” value to the New Cell Value input box (see first cluster in Figure 4.11). In our example, this would be either North Korea or South Korea.
  4. You can go through all groupings, or stop after one or two and click Merge Selected & Re-Cluster button. The clusters you chose to merge will be merged, and grouping will be re-calculated (don’t worry, the window won’t go anywhere). Keep regrouping until you are happy with the result.

Spend some time playing with Keying function parameters, and notice how they produce clusters of different sizes and accuracy.

Cluster similar text values.

Figure 4.11: Cluster similar text values.

Export

Once you are done cleaning up and clustering data, save the clean dataset by clicking Export button in the upper-right corner of OpenRefine window. You can choose your format (we recommend CSV, or comma-separated value). Now you have a clean dataset that is ready to be processed and visualized.

Summary

In this chapter, we looked at cleaning up tables in Google Sheets, liberating tabular data trapped in PDFs using Tabula, and using OpenRefine to clean up very messy datasets. You will often find yourself using several of these tools on the same dataset before it becomes good enough for your analysis. We encourage you to learn more formulas in Google Sheets, and explore extra functionality of OpenRefine in your spare time. The more clean-up tools and techniques you know, the more able and adaptable you become to tackle more complex cases.

You now know how to clean up your data, so let’s proceed to visualizing it. In the following chapter, we will introduce you to a range of free data visualization tools that you can use to build interactive charts and embed them in your website.