Chapter 4 Clean Up Messy Data

More often than not, datasets will be messy and hard to visualize right away. They will have missing values, various spelling of the same categories, dates in different formats, text in numeric-only columns, multiple things in the same columns, and other unexpected things (see Figure 4.1 for inspiration). Don’t be surprised if you find yourself spending longer cleaning up data than actually analyzing and visualizing it—it is often the case for data analysts.

More often than not, raw data looks like this.

Figure 4.1: More often than not, raw data looks like this.

It is important to learn several tools in order to know which one to use to clean your data efficiently. We will start by looking at fairly basic data cleanup using Google Sheets. Keep in mind that the same principles (and in most cases even the same formulas) can be use in Microsoft Excel, LibreOffice Calc, Mac’s Numbers, or other spreadsheet packages.

We will then show you how to extract table data from PDF documents using a free tool called Tabula. Tabula is used by data journalists and researchers worldwide to analyze government spendings, procurement records and all sorts of other datasets that get trapped in PDFs.

At the end, we will introduce OpenRefine, an extremely powerful and versatile tool to clean up the messiest spreadsheets, such as those containing dozens of misspelled versions of universities or town names.