Source Your Data

Another way to reduce “bad data” issues is to clarify the source every time you download or create a new spreadsheet file. Add details about where the data came from, so that someone other than you, several years in the future, has sufficient information to understand its origin and limitations.

The first step is to label every data file that you download or create. All of us have experienced bad file names like these:

  • data.csv
  • file.xls
  • download.xlsx

Write a short but meaningful file name. While there’s no perfect system, a good strategy is to abbreviate the source (such as census or worldbank or eurostat), with topic keywords, and a date or range. If you or co-workers will be working on different versions of a downloaded file, include the current date in YYYY-MM-DD (year-month-date) format. If you plan to upload files to the web, type names in all lower-case and replace blank spaces with dashes (-) or underscores (_). Better file names look like this:

  • town-demographics-2019-12-02.csv
  • census2010_population_by_county.xls
  • eurostat-1999-2019-co2-emissions.xlsx

The second step is to save more detailed source notes about the data on a separate tab inside the spreadsheet (which works for multi-tab spreadsheet tools such as Google Sheets, LibreOffice, and Excel). Add a new tab named notes that describes the origins of the data, a longer description for any abbreviated labels, and when it was last updated, as shown in Figure 3.1. Add your own name and give credit to collaborators who worked with you. For CSV files, which do not support multi-tabs sheets, create a text file using a parallel file name.

Create separate spreadsheet tabs for data, notes, and backup.

Figure 3.1: Create separate spreadsheet tabs for data, notes, and backup.

A third step is to make a backup of the original data before cleaning or editing it. For a simple one-sheet file in a multi-tab spreadsheet tool, right-click on the tab containing the data to make a duplicate copy in another tab, also shown in Figure 3.1. Clearly label the new tab as a backup and leave it alone! For CSV files or more complex spreadsheets, create a separate backup file.

Make a habit of using these three sourcing strategies—filenames, notes, and backups—to reduce your chances of making “bad data” errors and to increase the credibility of your data visualizations. In the next section, we’ll address a related set of questions you should ask yourself regarding public versus private data.