Recognize and Reduce Data Bias

We define bias as unfairly favoring one view over another. When working with data and designing visualizations, it’s important to be aware of different types of bias, so that you can recognize them as potential factors that may influence your perception, and reduce their presence in your own work. The first step toward reducing bias is to correctly identify various types, since at first glance they may appear hidden away, until we learn to call them out.

In Chapter 6: Make Clear Comparisons, we warned you to watch out biased data sampling that may yield inaccurate results. For example, selection bias happens when samples may appear to be random, but are not due to underlying processes, such as students in two different schools, but one requires an application process, even one that uses a lottery. Also beware of algorithmic bias that is built into our software, ranging from simple examples (such as web visitor IP addresses being converted into the nation’s geographic center) and more dangerous ones (TODO EXAMPLE from Ch6). Address both types of biases by exercising cautious about the data you choose to analyze, make only meaningful comparisons, and describe issues in the visualization notes and companion text to call out possible biases.

When you analyze and visualize data, look out for additional biases that many cognitive psychologists believe are built into human behavior, both in ourselves and our audiences. Confirmation bias refers to the tendency to accept only claims that fit our preconceived notions of how the world works. [TODO: refer to a dataviz experiment by Thaler?] Counter it by actively searching for and considering alternative interpretations, and look at contradictory findings with open eyes. Pattern bias is another human tendency to see meaningful relationships in data, even when the numbers are random. [TODO: refer to another dataviz experiment by Thaler?] Avoid this type of bias by continually reminding yourself (and your readers) that data is noisy, and our brains are wired to see patterns even when none exist. Refer to books on statistical data analysis for appropriate tests to determine whether patterns that appear to jump out to your eye can be confirmed, and whether their odds of existing are greater than chance. [TODO: check wording]

Just as people can lie with charts and maps, let’s not forget our long history of misleading audiences (and ourselves) with the word choices we make when describing data. Framing bias refers to negative or positive labels or conceptual categories that affect how we interpret information. For example, British statistician David Spiegelhalter notes that US hospitals tend to report mortality rates, while UK hospitals report survival rates. When weighing the risks of a surgical procedure for member of your family, a 5 percent mortality rate seems worse than a 95 percent survival rate, even though they’re identical. Furthermore, Spiegelhalter observes that when we supplement rates with raw numbers, it further increases our impression of risks. For example, if we told you a surgical procedure had a 5 percent mortality rate and that 20 out of 400 patients died, it seems worse because we begin to imagine real people’s lives, not abstract percentages.35 The best way to counter framing bias is to be aware of its potential effect on our minds and to call it out, as we’ve attempted to do here.

TODO INSERT AND EXPAND HERE? Also beware of algorithmic bias that people have built into our computer systems, which repeatedly favor some groups or outcomes over others, and often reinforce privileges held by dominant White, wealthy, masculine culture…. As we write this, several examples of algorithmic bias and machine-learning bias have appeared in the news. facial recognition across racial groups, or discrimination in home lending. examples:; Reduce bias by calling it out. Do not equate “digital” with “authoritative.” Cite more comprehensive books on this topic.

Intergroup bias refers to multiple ways that people privilege or discriminate by social categories, such as race, gender, class, sexuality, etc. In the wake of the Black Lives Matter movement, greater attention has been called to ways that intergroup bias pervades data visualization, and ways to counter its impact. Jonathan Schwabish and Alice Feng describe how they applied a racial equity lens to revise the Urban Institute’s Data Visualization Style Guide with a racial equity lens.36 Some recommendations are straightforward and relatively simple to implement. For example, they recommend ordering group labels to focus on the data story, rather than listing “White” and “Men” at the top by default. Also, we should proactively acknowledge missing groups in our data by calling attention to those often omitted, such as non-binary and transgender people in US federal datasets. Furthermore, when choosing color palettes to represent people in charts and maps, avoid stereotypical colors (such as blue for men and pink for women), and on a more subtle level, avoid color-grouping Black, Latino, and Asian people as the polar opposites of White people.

Other proposals by Schwabish and Feng are likely to generate more discussion and debate. For example, they recommend to stop placing disaggregated racial and ethnic data on the same chart, because it encourages a “deficit-based perspective” that judges lower-performing groups by the standards of higher-performing ones. Instead, they suggest plotting data about racial and ethnic groups on separate but adjacent charts, each with its own reference to state or national averages for comparison, as shown in Figure 15.14. The idea is interesting, but the example about Covid-19 pandemic data raises more questions about whose interests are served by revising how data is visualized On one hand, if predominantly White audiences perceive racial disparities in Covid data to be caused by group behavior, then it makes sense to stop feeding racist stereotypes and no longer compare different groups in the same chart. On the other hand, if these racial disparities are caused in part by structural obstacles to quality jobs, housing, and health care, then do separate charts make it harder to identify and challenge the roots of systemic racism? Schwabish and Feng raise important issues for deeper reflection. Yet once again, data visualization is not always driven by clearly-defined design rules. Instead, our mission is to find better ways to tell true and meaningful data stories, while working to identify and reduce bias all around us.

Schwabish and Feng recommend to stop placing racial and ethnic data on the same chart (left), and replace it with separate but adjacent charts with state or national averages as a comparison point (right).

Figure 15.14: Schwabish and Feng recommend to stop placing racial and ethnic data on the same chart (left), and replace it with separate but adjacent charts with state or national averages as a comparison point (right).

TODO above: DECIDE if description and critique of Schwabish and Feng is clear, interesting, and feasible with or without image (which probably would need to be redone and simplified).

Now that we’ve introduced various types of bias to consider when working with data visualization in general, in the next section we’ll focus on two additional types of bias that are specific to mapping.

  1. David Spiegelhalter, The Art of Statistics: Learning from Data (Penguin UK, 2019),, pp. 22-5↩︎

  2. Jonathan Schwabish and Alice Feng, “Applying Racial Equity Awareness in Data Visualization,” preprint (Open Science Framework, August 27, 2020), See also this web post summary of the paper, Jonathan Schwabish and Alice Feng, “Applying Racial Equity Awareness in Data Visualization,” Medium, accessed October 16, 2020, Urban Institute, “Urban Institute Data Visualization Style Guide,” 2020,↩︎