Mask or Aggregate Sensitive Data

#StandWithUkraine - Stop the Russian invasion

Join us and donate. Since 2022 we have contributed over $3,000 in book royalties to Save Life in Ukraine & Ukraine Humanitarian Appeal & The HALO Trust, and we will continue to give!

Even if individual-level data is legally and publicly accessible, each of us is responsible for making ethical decisions about if and how to use it when creating data visualizations. When working with sensitive data, some ethical questions to ask are: What are the risks that publicly sharing individual-level data might cause more harm than good? and Is there a way to tell the same data story without publicly sharing details that may intrude on individual privacy? There are no simple answers to these ethical questions, since every situation is different and requires weighing the risks of individual harm versus the benefits of broader knowledge about vital public issues. But this section clarifies some of the alternatives to blindly redistributing sensitive information, such as masking and aggregating individual-level data.

Imagine that you’re exploring crime data and wish to create an interactive map about the frequency of different types of 911 police calls across several neighborhoods. If you search for public data about police calls, as described in the Open Data section in this chapter, you’ll see different policies and practices for sharing individual-level data published by police call centers. In many US states, information about victims of sexual crimes or child abuse (such as the address where police were sent) is considered confidential and exempt from public release, so it’s not included in the open data. But some police departments publish open data about calls with the full address for other types of crimes, in a format like this:

| Date  | Full Address | Category           |
| Jan 1 | 1234 Main St | Aggravated Assault |

While this information is publicly available, it’s possible that you could cause some type of physical or emotional harm to the victims by redistributing detailed information about a violent crime with their full address in your data visualization.

One alternative is to mask details in sensitive data. For example, some police departments hide the last few digits of street addresses in their open data reports to protect individual privacy, while still showing the general location, in a format like this:

| Date  | Masked Address | Category           |
| Jan 1 | 1XXX Main St   | Aggravated Assault |

You can also mask individual-level data when appropriate, using methods similar to the Find and Replace method with your spreadsheet tool as in Chapter 4: Clean Up Messy Data.

Another strategy is to aggregate individual-level data into larger groups, which can protect privacy while showing broader patterns. In the example above, if you’re exploring crime data across different neighborhoods, grouping individual 911 calls into larger geographic areas, such as census tracts or area names, in a format like this:

| Neighborhood | Crime Category     | Frequency |
| East Side    | Aggravated Assault | 13        |
| West Side    | Aggravated Assault | 21        |

Aggregating individual-level details into larger, yet meaningful categories, is also a better way to tell data stories about the bigger picture. To aggregate simple spreadsheet data, see the summarizing with pivot tables section in Chapter 2. To geocode US addresses into census areas, or to pivot address points into a polygon map, or to normalize data to create more meaningful maps, see Chapter 13: Transform Your Map Data.

In the next section, you’ll learn how to explore datasets that governments and non-governmental organizations have intentionally shared with the public.