Differential Privacy for Census Data Explained

8/4/2020

data privacy

Introduction

The U.S. Census Bureau has had a longstanding requirement to ensure that the data from individuals and individual households remains confidential. For the 2020 census, it plans to use a new approach for doing so: “differential privacy.” 

This webpage provides:

  • Background on differential privacy for policy generalists.
  • The current status of decision-making for implementing differential privacy.
  • Questions data users and redistricters may want to consider.
  • How data users can communicate with the Census Bureau on this topic.
  • Additional resources.

Check out NCSL’s letters to the U.S. House, Senate and the Census Bureau and the Bureau's response to NCSL's letter. On behalf of the states, NCSL has expressed concerns regarding the delays in releasing the census data to the states and the bureau’s use of differential privacy and its possible impact on the accuracy of census data. 

On March 5, NCSL held a webinar on Differential Privacy and the 2020 Census and the recorded version is available. Viewers can learn about differential privacy and how this new methodology will work. And what effect, if any, it may have on census data and redistricting. 

Background

The U.S. Census Bureau is required to do an “actual Enumeration” of all the people living in the U.S. every 10 years (U.S. Constitution, Article 1, Section 2). The bureau also is required to keep personally identifiable information confidential for 72 years (92 Stat. 915; Public Law 95-416). Title 13, U.S. Code, Section 9, provides the mandate for the bureau to not “use the information furnished under the provisions of this title for any purpose other than the statistical purposes for which it is supplied; or make any publication whereby the data furnished by any particular establishment or individual under this title can be identified; or permit anyone other than the sworn officers and employees of the Department or bureau or agency thereof to examine the individual reports (13 U.S.C. § 9 (2007)).”

The dual requirement for an accurate count and the protection of respondents and their data creates a natural tension: The more accurate (and therefore usable) the reported data is, the easier it may be to identify individual responses. And yet, as the raw data is altered before being reported (to protect confidentiality), the less usable the publicly released data is.

The bureau has provided a history of how it has handled this dual requirement in “Disclosure Avoidance Techniques Used for the 1960 Through 2010 Census.” The bureau has also created an infographic with this information, “A History of Privacy Protections.”

Due to Privacy Concerns, Reported Data Has Always Been Different from Raw Data

Since 2000, the bureau has used “data swapping” between census blocks as its main disclosure avoidance technique. (The census block is the smallest unit of geography maintained by the bureau.)

Consider a census block with just 20 people in it, including one Filipino American. Without any disclosure avoidance effort, it might be possible to figure out the identity of that individual. With data swapping, the Filipino American’s data might be swapped with that of an Anglo American from a nearby census block—a census block where other Filipino Americans reside. The details for the person would be aggregated with others, and therefore not identifiable, and yet the total population in both census blocks would remain accurate.

Big Data Creates the Need for Greater Privacy Measures

Since the release of the 2010 census, bureau staff have realized that data analysts could take the many data products the bureau produces and cross-reference them with each other or with outside data sources to the point that individual privacy, or confidentiality, could be compromised. (This is possible now, as opposed to earlier decades, because of greater computing power and the growth of other databases, such as credit reporting.)

There is no evidence that confidentiality has been compromised so far, but that doesn’t change the theoretical possibility that it could happen.  

Because of that possibility, in the 2010s the bureau reviewed disclosure avoidance methods that could replace the current data swapping method. Differential privacy has been selected, and is described by the bureau at this webpage, which includes links to many presentations and papers on how differential privacy works.

Current Status

Although the decision to move to differential privacy was made in 2018, the parameters that guide this new disclosure avoidance method are still being evaluated. Final decisions on the details for the new approach are expected in late 2020, with the opportunity to provide feedback open now.

The use of differential privacy would mean that reported data will not be the same as the raw data. (Note that reported census data has never been the raw data; imputation has been used to assign people when they haven’t responded to census enumerators, and data swapping has been used for two decades for disclosure avoidance. And, the census has always had undercounts and overcounts in different areas and for different populations.)

How Differential Privacy Affects Reported Data

With differential privacy, the bureau has stated that the total population in each state will be “as enumerated,” but that all other levels of geography—including congressional districts down to townships and census blocks—could have some variance from the raw data. This is referred to by the Census Bureau as “injecting noise” into the data. No “noise” would be injected into the state total population, but in smaller geographic units, “noise” can be expected.  

The Census Bureau Must Determine the Level of Noise Injected Into the Data

Final decisions about the mathematical model used for differential privacy, and therefore the impact on reported data, have yet to be made. On one extreme, to have zero risk of privacy disclosure, all totals reported would have to have some “noise” injected (or some variation from the actual count). On the other extreme, if there were no noise injected, the risk of privacy disclosure would be great. These two variables—risk of disclosure and accuracy—can be measured against each other and, in fact, create a trade-off. The bureau refers to this as a “privacy loss budget.”

Several Data Attributes Will Not Be Modified by Differential Privacy

The bureau’s proposal at the time of the creation of the 2010 Demonstration Data Products indicated that three data points will be kept “invariant,” or will be reported as enumerated: total state population, as mentioned above; census block-level total housing units; and census block-level group quarter counts. In 2010 and previous decades, all these were kept “invariant” along with most data at the census block level, with the exception of race. All other data, including total population numbers for lower geographic units and demographic characteristics, will vary to some extent this decade.

Differential privacy will mean that, except at the state level, population and voting age population will not be reported as enumerated. And, race and ethnicity data are likely to be farther from the “as enumerated” data than in past decades, when data swapping was used to protect small populations. (In 2010, at the block level, total population, voting age poulation, total housing units, occupancy status, group quarters count and group quarters type were all held invariant.) This may raise issues for racial block voting analyses.

While differential privacy is intended to protect confidentiality for respondents, it has implications for smaller subpopulations. For instance, the National Congress of American Indians notes, “The implementation of differential privacy could introduce substantial amounts of noise into statistics for small populations living in remote areas, potentially diminishing the quality of statistics about tribal nations.”

How Will This Change Affect Census Data Users?

Because of usability concerns, the bureau in October 2019 released 2010 Demonstration Data Products, which provide 2010 raw data treated with the new differential privacy method. Thus, data treated with the differential privacy method of disclosure avoidance can be compared with the 2010 released data (which had been treated with data swapping, the 2010 disclosure avoidance method).

One question for redistricters is whether the reported data from the 2010 Demonstration Data Products is so different from the 2010 data reported by the Census Bureau that it impacts redistricting. This data is available for anyone to use, and the bureau welcomes feedback on this and other questions.

From analyses done by the bureau in conjunction with the National Academy of Sciences Committee on National Statistics, and by outside data users, a few issues have surfaced. The Census Bureau is aware of these issues and is working to address them.

  • Rural areas will see a greater variance from the raw data than urban areas. Specifically, rural areas are likely to show increases in population and urban areas may show decreases in population. And, reported population totals in districts comprised of rural areas are likely to be less accurate than those created in more densely populated areas. The greater the difference between small and large counties or other units, the greater the variance will be.
  • Smaller subpopulations, such as specific racial groups, will be affected more than larger racial or ethnic groups.
  • Household data is separated from population data, leading to some logical inconsistencies, such as households that show a population of less than one, households with children but no adults, and areas that are known to be unpopulated that will have population assigned.
  • The impact on states will vary, depending on their overall demographics. 
  • Longitudinal studies based on census data may be compromised.

Questions Differential Privacy Has Prompted for Redistricters

  • Will the data for congressional reapportionment be precise? The answer is yes, in that the total population for each state will be reported as enumerated and will not be subject to noise.
  • Will population data below the state level be accurate enough for redistricting within a state, or within local jurisdictions? In other words, will redistricters be able to establish population equality between districts, and determine what is an effective minority district?  
  • Does differential privacy endanger the requirements under PL 94-171, which requires the bureau to provide the states with data at the census block level, as is needed for legislative redistricting? This data includes population and race/ethnicity characteristics.
  • Is there more than one way to interpret the mandate for disclosure avoidance set forth in Title 13? How much should usability be compromised to protect confidentiality and vice versa, since both are legal requirements?
  • What data should be kept invariant? Without population at the census block level being invariant, deviations from ideal district size will be hard to calculate.
  • Will differential privacy make it harder to do longitudinal studies, in that the way new data will be treated is different than how previous data has been treated?
  • Can this method be changed enough before implementation to maintain the relationship between households and population?
  • Will census data stand up in court? Redistricters in most states and local jurisdictions are legally required to use census data. In other words, will census data be satisfactory to determine whether a plan meets the one-person, one-vote principle? Plaintiffs will carry the burden of proving that more accurate data is available.

For data users outside of the redistricting realm (businesses, policymakers, academics) differential privacy may raise other concerns, such as whether the details for small geographic regions or specific subpopulations are sufficiently accurate for decision-making.

Providing feedback

Those who are interested in how the bureau balances confidentiality and usability—or, in census parlance, how the “privacy loss budget” should be allocated—can provide comments to the bureau through its data demonstration project, dcmd.2010.demonstration.data.products@census.gov.

While there is no cutoff date for comments, a final decision will be made by fall 2020. Comments received in the spring will be easier for the bureau to incorporate.

Additional Resources