Recognize Bad Data
When your data search produces some results, another key step is to open the file, quickly scroll through the content, and look for any warning signs that it might contain “bad data.” If you fail to catch a problem in your data at an early stage, it could lead to false conclusions and diminish the credibility of all of your work. Fortunately, members of the data visualization community have shared multiple examples of problems we’ve previously encountered, to help save newer members from making the same embarrassing mistakes. One popular crowd-sourced compilation by data journalists was The Quartz Guide to Bad Data, last updated in 2018. Watch out for spreadsheets containing these “bad data” warning signs:
- Missing values: If you see blank or “null” entries, does that mean data was not collected? Or maybe a respondent did not answer? If you’re unsure, find out from the data creator. Also beware when humans enter a
-1to represent a missing value, without thinking about its consequences on running spreadsheet calculations, such as SUM or AVERAGE.
- Missing leading zeros: One of the zip codes for Hartford, Connecticut is
06119. If someone converts a column of zip codes to numerical data, it will strip out the leading zero and appear as
6119. Similarly, the US Census Bureau lists every place using a FIPS code, and some of these also begin with a meaningful zero character. For example, the FIPS code for Los Angeles County, California is
037, but if someone accidentally converts a column of text to numbers, it will strip out the leading zero and convert that FIPS code to
37, which represents the state of North Carolina.
- 65536 rows or 255 columns: These are the maximum number of rows supported by older-style Excel spreadsheets, or columns supported by Apple Numbers spreadsheet, respectively. If your spreadsheet stops exactly at either of these limits, you probably have only partial data.
- Inconsistent date formats: For example, November 3rd, 2020 is commonly entered in spreadsheets in the US as
11/3/2020(month-date-year), while people in other locations around the globe commonly type it as
3/11/2020(date-month-year). Check your source.
- Dates such as January 1st 1900, 1904, or 1970: These are default timestamps in Excel spreadsheets and Unix operating systems, which may indicate the actual date was blank or overwritten.
- Dates similar to
43891: When you type
March 1during the year 2020 into Microsoft Excel, it automatically displays as
1-Mar, but is saved using Excel’s internal date system as
43891. If someone converts this column from date to text format, you’ll see Excel’s 5-digit number, not the dates you’re expecting.
Another way to review the quality of data entry in any column in a spreadsheet is to create a filter or a pivot table as described in chapter 2. This allows you to quickly inspect the range of values that appear in that column, and whether they match what you expected to find.
What should you do when you discover bad data in your project? Sometimes small issues are relatively straightforward, do not call into question the integrity of the entire dataset, and you can fix them using methods to clean up messy data that we describe in chapter 4. But larger issues are more problematic. Follow the source of your data stream to try to identify where the issue began. If you cannot find and fix the issue on your own, contact the data provider to ask for their input, since they should have a strong interest in improving the quality of the data. If they cannot resolve an important data issue, then you need to pause and think carefully. In this case, is it wiser to continue working with problematic data and add a cautionary note to readers, or should you stop using the dataset entirely and call attention to its underlying problem? These are not easy decisions, and you should ask for opinions from colleagues. In any case, never ignore the warning signs of bad data.
In the next section, you’ll learn more questions to help you know your data at a deeper level.