Clean Data with OpenRefine

Look at the sample US Foreign Aid dataset shown in Figure 5.9. Can you spot any problems with it?

Can you spot any problems with this sample data?

Figure 5.9: Can you spot any problems with this sample data?

Notice how the Country column various spellings of North and South Korea. Also note how the FundingAmount column is not standardized. Some amounts use commas to separate thousands, while some uses spaces. Some amounts start with a dollar sign, and some do not. Datasets like this can be an absolute nightmare to analyze. Luckily, OpenRefine provides powerful tools to clean up and standardize data.

Note: This data excerpt is from US Overseas Loans and Grants (Greenbook) dataset, which shows US economic and military assistance to various countries. We chose to only include assistance to South Korea and North Korea for the years between 2000 and 2018. We added deliberate misspellings and formatting issues for demonstration purposes, but we did not alter values.

Set up OpenRefine

Let’s use OpenRefine to clean up this messy data. Download OpenRefine for Windows, Mac, or Linux. Just like Tabula, it runs in your browser and no data leaves your local machine, which is great for confidentiality.

If you use Windows, unzip the downloaded file. Double-click the .exe file, and OpenRefine should open in your default browser.

If you work on a Mac, the downloaded file will be a .dmg file to be installed. But you will likely see a security warning that prevents OpenRefine from launching because Apple cannot identify the developer for this open-source project. To resolve the problem, go to System Preferences > Security and Privacy > General tab, and click the Open Anyway button in the lower half of the window, as shown in Figure 5.10. If prompted with another window, click Open.

If your Mac displays a warning about launching Open Refine (on left), go to System Preferences - Security and Privacy - General tab and click Open Anyways (on right).

Figure 5.10: If your Mac displays a warning about launching Open Refine (on left), go to System Preferences - Security and Privacy - General tab and click Open Anyways (on right).

Once launched, you should see OpenRefine in your browser with 127.0.0.1:3333 address (localhost, port 3333), as shown in Figure 5.11.

OpenRefine starting page.

Figure 5.11: OpenRefine starting page.

Load Data and Start a New Project

To start cleaning up messy dataset, we need to load it into a new project. OpenRefine lets you upload a dataset from your local machine, or a remote web address (such as a Google Sheet). OpenRefine also can extract data directly from SQL databases, but this is beyond the scope of this book.

  1. Download the sample messy data on US Foreign Aid in CSV format to your computer.
  2. Under Get data from: This computer, click Browse… and select the CSV file you downloaded above. Click Next.
  3. Before you can start cleaning up data, OpenRefine allows you to make sure data is parsed properly. In our case, parsing means the way the data is split into columns. Make sure OpenRefine assigned values to the right columns, or change setting in Parse data as block at the bottom of the page until it starts looking meaningful, like shown in Figure 5.12.
  4. Hit Create Project in the upper-right corner.
OpenRefine parsing options.

Figure 5.12: OpenRefine parsing options.

Now when you’ve successfully read the data into a new project, let’s start the fun part: converting text into numbers, removing unnecessary characters, and fixing the spellings for North and South Koreas.

Convert Dollar Amounts from Text to Numbers

Once your project is created, you will see the first 10 rows of the dataset. You can change it to 5, 10, 25, or 50 by clicking the appropriate number in the header

Each column header has its own menu (callable by clicking the arrow-down button). Left-aligned numbers in a column are likely represented as text (as is the case with FundingAmount column in our example), and they need to be transformed into numeric format.

  1. To transform text into numbers, open the column menu, and go to Edit cells > Common transforms > To number.
  2. You will see that some numbers became green and right-aligned (success!), but most did not change. That is because dollar sign ($) and commas (,) confuse OpenRefine and prevent values to be converted into numbers.
  3. Let’s remove $ and , from the FundingAmount column. In the column menu, choose Edit cells > Transform. In the Expression window, type value.replace(',', '') and notice how commas disappear in the preview window. When you confirm your formula works, click OK.
  4. Now, repeat the previous step, but instead of a comma, remove the $ character. (Your expression will become value.replace('$', '')).
  5. In steps 3 and 4, we replaced text (string) values with other string values, making OpenRefine think this column is no longer numeric. As a result, all values are once again left-aligned and in black. Perform step 1 again to see that all but three cells turning green (successfully converting to numeric). Now we need to remove spaces and an a character at the end of one number. Fix those manually by hovering over cells, and clicking the edit button (in the new popup window, make sure to change Data type to number, and hit Apply, like in Figure 5.13).
Manually remove spaces and extra characters, and change data type to number.

Figure 5.13: Manually remove spaces and extra characters, and change data type to number.

At this point, all funding amounts should be clean numbers, right-aligned and colored in green. We’re ready to move on to the Country column and fix different spellings of Koreas.

Cluster Similar Spellings

When you combine different data sources, or process survey data where respondents wrote down their answers as opposed to selecting them from a dropdown menu, you might end up with multiple spellings of the same word (town name, education level – you name it!). One of the most powerful features of OpenRefine is the ability to cluster similar responses.

If you use our original sample file, take a look at the Country column and all variations of North and South Korea spellings. From Country column’s dropdown menu, go to Facet > Text facet. This will open up a window in the left-hand side with all spellings (and counts) of column values. 26 choices for a column that should have just two distinct values, North Korea and South Korea!

  1. To begin standardizing spellings, click on the arrow-down button of Country column header, and choose Edit cells > Cluster and edit. You will see a window like the one shown in Figure 5.14.
  2. You will have a choice of two clustering methods, key collision or nearest neighbor. Key collision clustering is a much faster technique that is appropriate for larger datasets, but it is less flexible. Nearest neighbor is a more computationally expensive approach and will be slow on larger datasets, but it allows for greater fine-tuning and precision. Both methods can be powered by different functions, which you can read about on the project’s Wiki page. For the purpose of this exercise, let’s leave the default key collision method with fingerprint function.
  3. OpenRefine will calculate a list of clusters. Values in Cluster column contains grouped spellings that OpenRefine considers the same. If you agree with a grouping, check the Merge? box, and assign the “true” value to the New Cell Value input box (see first cluster in Figure 5.14). In our example, this would be either North Korea or South Korea.
  4. You can go through all groupings, or stop after one or two and click Merge Selected & Re-Cluster button. The clusters you chose to merge will be merged, and grouping will be re-calculated (don’t worry, the window won’t go anywhere). Keep regrouping until you are happy with the result.

Spend some time playing with Keying function parameters, and notice how they produce clusters of different sizes and accuracy.

Cluster similar text values.

Figure 5.14: Cluster similar text values.

Export

Once you are done cleaning up and clustering data, save the clean dataset by clicking Export button in the upper-right corner of OpenRefine window. You can choose your format (we recommend CSV, or comma-separated value). Now you have a clean dataset that is ready to be processed and visualized.

Summary

In this chapter, we looked at cleaning up tables in Google Sheets, liberating tabular data trapped in PDFs using Tabula, and using OpenRefine to clean up very messy datasets. You will often find yourself using several of these tools on the same dataset before it becomes good enough for your analysis. We encourage you to learn more formulas in Google Sheets, and explore extra functionality of OpenRefine in your spare time. The more clean-up tools and techniques you know, the more able and adaptable you become to tackle more complex cases.

You now know how to clean up your data, so let’s proceed to visualizing it. In the following chapter, we will introduce you to a range of free data visualization tools that you can use to build interactive charts and embed them in your website.