Normalize Your Data
When we work with data expressed in counts, such as 3,133 motor vehicle crash deaths in Florida in 2018, it usually makes no sense to compare these numbers until we normalize them. This means to adjust data that has been collected using different scales into a common reference scale, or in other words to convert raw data into rates to make more meaningful comparisons. Even if you’ve never heard the term, perhaps you’re already normalizing data without realizing it.
Here’s an example about motor vehicle safety that was inspired by visualization expert Alberto Cairo, with updated 2018 data from the Insurance Institute for Highway Safety (IIHS) and the US Department of Transportation.20 Over 36,000 people died in motor vehicle crashes in 2018, including car and truck drivers and occupants, motorcyclists, pedestrians, and bicyclists. Although only a small fraction of this data appears in the tables below, you can view all of the data in Google Sheets format, and save an editable copy to your Google Drive, to follow along in this exercise.
Let’s start with what appears to be a simple question, and see where our search for more meaningful comparisons takes us.
- Which US states had the lowest number of motor vehicle crash deaths? When we sort the data by the numbers of deaths, the District of Columbia appears to be the safest state with only 31 deaths, as shown in Table 6.1, even though Washington DC is not legally recognized as a state.
|District of Columbia||31|
But wait—this isn’t a fair comparison. Take another look at the five states above and you’ll may notice that all of them have smaller populations than larger states, such as California and Texas, which appear at the very bottom of the full dataset. To paint a more accurate picture, let’s rephrase the question to adjust for population differences.
- Which US states had the lowest number of motor vehicle crash deaths when adjusted for population? Now let’s normalize the death data by taking into account the total population of each state. In our spreadsheet, we calculate it as
Deaths / Population * 100,000. While it’s also accurate to divide deaths by population to find a per capita rate, those very small decimals would be difficult for most people to compare, so we multiply by 100,000 to present the results more clearly. When we sort the data, Washington DC appears to be the safest once again, with only 4.4 motor vehicle crash deaths per 100,000 residents, as shown in Table 6.2
|State||Deaths||Population||Deaths per 100,000 population|
|District of Columbia||31||702,455||4.4|
But wait—this still isn’t a fair comparison. Look at the five states on the list and you’ll notice that all of them are located along the Northeastern US corridor, which has a high concentration of public transit, such as trains and subways. If people in urban areas like New York and Boston are less likely to drive motor vehicles, or take shorter trips than people in rural states where homes are more distantly spread out, that might affect our data. Let’s strive for a better comparison and rephrase the question again, this time to adjust for differences in mileage, not population.
- Which US states had the lowest number of motor vehicle crash deaths when adjusted for vehicle mileage? Once again, we normalize the death data by adjusting it to account for a different factor: vehicle miles traveled (VMT), the estimated total number of miles (in millions) traveled by cars, vans, trucks, and motorcycles, on all roads and highways in the state, in 2018. In our spreadsheet, we calculate it as
Deaths / Vehicle Miles * 100, with the multiplier to present the results more clearly. This time Massachusetts appears to be the safest state, with only 0.54 motor vehicle crash deaths per 100 million miles traveled, as shown in as shown in Table 6.3. Also, note that the District of Columbia has fallen further down the list and been replaced by Minnesota.
|State||Deaths||Vehicle miles traveled (millions)||Deaths per 100 million vehicle miles traveled|
Have we finally found the safest state as judged by motor vehicle crash deaths? Not necessarily. While we normalized the raw data relative to the population and amount of driving, the IIHS reminds us that several other factors may influence these numbers, such as vehicle types, average speed, traffic laws, weather, and so forth. But as Alberto Cairo reminds us, every time we refine our calculations to make a more meaningful comparison, our interpretation becomes a closer representation of the truth. “It’s unrealistic to pretend that we can create a perfect model,” Cairo reminds us. “But we can certainly come up with a good enough one.”21
As we demonstrated above, the most common way to normalize data is to adjust raw counts into relative rates, such as percentages or per capita. But there are many other ways to normalize data, so make sure you’re familiar with different methods when you find and question your data, as described in chapter 4. When working with historical data (also called time-series or longitudinal data), you may need to adjust for change over time. For example, it’s not fair to directly compare median household income in 1970 versus 2020, because $10,000 US dollars had far more purchasing power a half-century ago than it does today, due to inflation and related factors. Similarly, economists distinguish between nominal data (unadjusted) versus real data (adjusted over time), typically by converting figures into “constant dollars” for a particular year that allow better comparisons by accounting for purchasing power.22 Also, economic data is often seasonally adjusted to improve comparisons for data that regularly varies across the year, such as employment or revenue during the summer tourism season versus the winter holiday shopping season. Another normalization method is to create an index to measure how values have risen or fallen in relation to a given reference point over time. Furthermore, statisticians often normalize data collected using different scales by calculating its standard score, also known as its z-score, to make better comparisons. While these methods are beyond the scope of this book, it’s important to be familiar the broader concept: everyone agrees that it’s better to compare apples to apples, rather than apples to oranges.
Finally, you do not always need to normalize your data, because sometimes its format already does this for you. Unlike raw numbers or simple counts, most measured variables do not need normalization because they already appear on a common scale. One example of a measured variable is median age, the age of the “middle” person in a population, when sorted from youngest to oldest. Since we know that humans live anywhere between 0 and 120 years or so, we can directly compare the median age among different populations. Similarly, another measured variable is median income, if measured in the same currency and in the same time period, because this offers a common scale that allows direct comparisons across different populations.
Now that you have a better sense of why, when, and how to normalize data, the next section will warn you to watch out for biased comparisons in data sampling methods.
Alberto Cairo, The Truthful Art: Data, Charts, and Maps for Communication (Pearson Education, 2016), https://www.google.com/books/edition/The_Truthful_Art/8dKKCwAAQBAJ, pp. 71-74.↩︎
Cairo, p. 95↩︎
“What’s Real About Wages?” Federal Reserve Bank of St. Louis, The FRED Blog, February 8, 2018, https://fredblog.stlouisfed.org/2018/02/are-wages-increasing-or-decreasing/↩︎