De-identification Guidelines for Structured Data

De-identification Guidelines for Structured Data

June 2016

Acknowledgments

We would like to thank Dr. Khaled El Emam for reviewing an earlier version of these guidelines and providing helpful comments.

CONTENTS

Introduction..................................................................................................................... 1 Scope of Guidelines.......................................................................................................... 2 Overview of De-identification............................................................................................ 3 Uses of De-identification.................................................................................................. 5

Open Data................................................................................................................... 5 Access to Information Requests..................................................................................... 5 Data Sharing within and among Institutions.................................................................... 6 Process for De-identifying Structured Data........................................................................ 6 Step 1: Determine the Release Model............................................................................. 7 Step 2: Classify Variables.............................................................................................. 8 Step 3: Determine an Acceptable Re-identification Risk Threshold................................... 9 Step 4: Measure the Data Risk..................................................................................... 11 Step 5: Measure the Context Risk................................................................................ 13 Step 6: Calculate the Overall Risk................................................................................ 17 Step 7: De-identify the Data....................................................................................... 17 Step 8: Assess Data Utility.......................................................................................... 19 Step 9: Document the Process..................................................................................... 20 De-identification Governance......................................................................................... 20 Protecting Against Attribute Disclosure........................................................................ 21 Ongoing and Regular Re-identification Risk Assessments.............................................. 22 Conclusion..................................................................................................................... 22

INTRODUCTION

As the demand for government-held data increases, institutions require effective processes and techniques for removing personal information. An important tool in this regard is deidentification.

"De-identification" is the general term for the process of removing personal information from a record or data set. De-identification protects the privacy of individuals because once deidentified, a data set is considered to no longer contain personal information. If a data set does not contain personal information, its use or disclosure cannot violate the privacy of individuals.1 Accordingly, the privacy protection provisions of the Freedom of Information and Protection of Privacy Act (FIPPA) and the Municipal Freedom of Information and Protection of Privacy Act (MFIPPA) would not apply to de-identified information.

It is important to note that de-identification does not reduce the risk of re-identification of a data set to zero. Rather, the process produces data sets for which the risk of re-identification is very small.

These guidelines will introduce institutions to the basic concepts and techniques of deidentification. They outline the key issues to consider when de-identifying personal information in the form of structured data and they provide a step-by-step process that institutions can follow when removing personal information from data sets.

De-identification can be a complex and technically challenging process. These guidelines take a conservative approach to risk in order to simplify the calculations involved in measuring it. However, some degree of complexity in the process is unavoidable.

When dealing with issues that may arise in de-identification, it is important that you seek advice from technical staff, or other experts in the field (such as your freedom of information and privacy coordinator, or legal counsel). The information contained in these guidelines can serve as a starting point for discussions with those individuals.

Some of the complexity and challenges of de-identification can be addressed through the use of automated tools. While it is possible (and may be appropriate in certain circumstances) to de-identify data sets manually, there are many software tools available that can automate some aspects of the process. When seeking to de-identify a data set, you may wish to consider using de-identification software.

1 Note, however, that the same cannot be said with respect to the rights of groups of individuals. For a discussion of how to protect against harms relating to groups of individuals when de-identifying data sets, see the section on "Deidentification Governance" below.

De-identification Guidelines for Structured Data

1

TERMINOLOGY

Some of the technical terms used in these guidelines are defined below.

adversary: individual or entity attempting to re-identify one or more individuals in the data set

brute force attack: trial-and-error attack that involves attempting all possible combinations to decode an encrypted value

masking: the process of removing a variable or replacing it with pseudonymous or encrypted information

one-way hash function: cryptographic mapping function that is practically impossible to reverse, that is, to recreate the input data from its encrypted value

re-identification: any process that re-establishes the link between identifiable information and an individual

release model: manner in which recipients of a data set are provided access to it

structured data (data set): collection of data in tabular form where every column represents a variable and every row represents a member or individual

target individual: individual targeted by an adversary for re-identification

variable: column of values in a data set representing a set of attributes

SCOPE OF GUIDELINES

Approaches to de-identification range from simple "cookie cutter" lists of variables to be removed or modified, to general loosely defined techniques such as the "cell size of five" rule,2 to systematic risk-based methodologies. While it may be possible to de-identify data sets in different ways, these guidelines offer direction on taking a risk-based approach to deidentification.3

Risk-based de-identification involves calculating an acceptable level of re-identification risk for a given data release. The calculation requires the consideration of a number of factors, including whether an adversary can know if a target individual is in the data set. If an adversary knows that a target individual is in the data set, this is called "prosecutor risk." For example, if a teenager's parents know that their child has participated in a survey and the results are to be released in de-identified form, the risk of the parents attempting to re-identify their child's responses would qualify as prosecutor risk. If an adversary does not, or cannot, know if a target

2 The cell size of five rule is the practice of releasing aggregate data about individuals only if the number of individuals counted for each cell of the table is greater than or equal to five.

3 The approach to de-identification presented in these guidelines is based largely on the risk-based de-identification methodology developed by Dr. Khaled El Emam. For a select list of books and articles written and co-authored by Dr. El Emam on the topic of de-identification, see Appendix A: Resources.

2

De-identification Guidelines for Structured Data

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download