Reidentification of Individuals in

Reidentification of Individuals in

Chicago's Homicide Database

A Technical and Legal Study

Salvador Ochoa	Jamie Rasmussen	Christine Robson	Michael Salib
	Collective address:	reidentify@mit.edu

Abstract

Many government agencies, hospitals, and other organizations collect personal data of a sensitive nature. Often, these groups would like to release their data for statistical analysis by the scientific community, but do not want to cause the subjects of the data embarrassment or harassment. To resolve this conflict between privacy and progress, data is often deidentified before publication. In short, personally identifying information such as names, home addresses, and social security numbers are stripped from the data. We analyzed one such deidentified data set containing information about Chicago homicide victims over a span of three decades. By comparing the records in the Chicago data set with records in the Social Security Death Index, we were able to associate names with, or reidentify, 35% of the victims. This study details the reidentification method and results, and includes a legal review of U.S. regulations related to reidentification. Based on the findings of our project, we recommend removal of these databases from their online locations, and the establishment of national deidentification regulations.

Introduction

We are in an age of rapidly developing technologies that open up possibilities for privacy invasions never before conceived of. With the Internet, the world has been introduced to a new way to compile, exchange, and manipulate data at speeds and volumes heretofore unimagined. Indeed, laws and standards can scarcely keep up with the potentials for privacy invasion.

Our project involves publicly released databases, complied by the United States government for statistical purposes, but disseminated in a manner that allows identification of individuals. In particular, we examined the Chicago Homicide data set, compiled by the Bureau of Justice Statistics and published online by the National Archive of Criminal Justice Data. By combining this data with the Social Security Death Index, also available online, we were able to successfully determine the identity of 35% of the individuals who are supposedly anonymously listed in the database.

In this paper, we will review reidentification theory, paying special note to the work of Professor Latanya Sweeney, of Carnegie Mellon University, and her work with medical databases. We will also describe our methodology for reidentification, including the details of our database matching. A comprehensive analysis of the laws surrounding reidentification is also included. Based on the findings of our project, we will be recommending removal of these databases from their online locations, and the establishment of national deidentification regulations. We conclude the report with both legal and technical recommendations for protection against reidentification.

Reidentification theory

The purpose of this chapter is to provide an introduction to reidentification theory. Later chapters describe a homicide victim reidentification experiment in great detail. This section is intended as a primer for the non-technical reader, explaining many key terms and concepts that will be used throughout this document, so that he/she may fully understand the significance of the project. It is also intended as an overview of the modern trend of increased data collection and sharing, the privacy concerns resulting from such data sharing, and the reasons why reidentification is being done. The technical, informed reader may skip this section without any loss of information.

Key Terms and Concepts

Reidentification concerns manipulating databases to determine the identity of individuals whose information is recorded as records within a deidentified database through data linkage techniques. To best understand this concept, we first define a few terms and then provide a simple example.

A database is a collection of data organized in such a way that a computer program can quickly search for and retrieve desired pieces of information. It is typically stored on magnetic disk or some other secondary storage device, and it is designed to allow for fast and efficient data-processing operations including the storage, retrieval, modification, and deletion of data.

A database can consist of multiple files, each of which is broken down into records. Each record is a complete set of information on a specific entity and is made up of any number of fields, each of which contains information pertaining to one individual aspect or attribute of the entity. For example, a student directory file contains records that may include four fields: a student name field, an address field, a phone number field, and a major field. Each record may also be considered an n-tuple of the n different fields that make up the record. A database can be modeled as a simple table where each row corresponds to an individual record and each column corresponds to a field.

Figure 1: Table Representation of a Student Directory Database

The above figure depicts a table representation of a student directory database. Each record, or row, contains the directory information for a single student. The record for Ben Bitdiddle is highlighted. Each record is made up of the four fields described earlier, shown as columns. The Address field is highlighted.

The term database is increasingly being used as shorthand for a database management system (DBMS), which is the actual software that is used to perform the data-processing operations mentioned earlier. More formally, a database management system is a collection of programs that enables you to store, modify, and extract information from a database. To be specific, we used PostgreSQL a relational database management system, or RDBMS. These database systems are powerful because they require few assumptions about how data is related or how it will be extracted from the database, and unlike flat database systems, they can work with multiple files.

Requests for information from a database are made in the form of a query, which is a stylized question. For example, the query:

SELECT ALL WHERE MAJOR = "POLITICAL SCIENCE"

if run on the database in the above figure, would request all records in which the MAJOR field is "Political Science." This query would only result in one value: Joe Law. The set of rules for constructing queries is known as a query language. Although different DBMSs support different query languages, there is a semi-standardized query language called SQL (structured query language), which we used in our project.

Databases, as mentioned, allow for quick retrieval of desired data, or information. This allows for what is now referred to as data mining. Data mining describes finding previously unknown patterns, or relationships in a group of data. In order to support current research in a variety of fields, there has been a tremendous increase in the amount of information that is being collected and stored, so that data mining can produce more results.

Another aspect of databases, which begins to introduce us to the reidentification problem, is the ability to do data linkage. Data linkage refers to combining disparate pieces of entity-specific information to learn more about an entity. That is, a researcher can combine information from different databases about an entity if he/she can match the records. In the figure below, data linkage of two databases is possible. One database has students’ major and GPA information while another has students’ biographic information. Each database has student’s names, so an administrative official could easily link the two databases using students’ names to make a single database with all of the students’ information.

Figure 2: Data Linkage

Although we have been discussing each record as corresponding to an entity, the databases that we are concerned about are those in which each record corresponds to an individual person. In other words, the databases we used in our experiment contain person-specific data, since we are interested in the reidentification of people. Data linkage is important in this respect since it allows for larger profiles.

Growth of Public Data

As a result of the many advancements in computer-related technology in recent years, primary and secondary data storage devices continue to become more affordable. High-speed network connections are also becoming more available to the average consumer as broadband connections such as DSL and cable are increasingly being offered and promoted by Internet service providers.

During recent years, however, as a result of the increased availability of storage devices, society has also been witness to what can only be described as a data explosion. Although we recognize that we live in the Information Age, what many do not realize is that much of the information that is being collected today is about individuals. Latanya Sweeney, one of the trailblazers in the field of reidentification research and theory, has described in her thesis on reidentification that "there has been tremendous growth in the collection of information being collected on individuals and this growth is related to access to inexpensive computers with large storage capacities." She also asserts that because the affordability of these systems will only increase in the years ahead, "the trend in collecting increasing amounts of information is expected to continue. As a result, many details in the lives of people are being documented in databases somewhere."

Her research has led her to find three major trends with regard to data collection: (1) "collect more;" (2) "collect specifically;" and, (3)"collect if you can." As an example of the collect more trend, she describes how birth records moved from having only seven to fifteen fields per live birth at the beginning of the twentieth century, to about 25 fields in later years, but jumping to over 100 fields per live births as the availability and use of electronic equipment in hospitals and clinics has increased in the latter part of the century. By "collect specifically," she means that instead of collecting tabular information, many entities are now collecting person-specific information. She lists supermarkets as an example; they, using the now familiar loyalty, or saver cards, can collect information about clients’ purchases. She also points to the fact that many entities are now collecting information simply because it has now become possible for them to do so. These include immunization record databases for example.

Sweeney, using what she refers to as the global disk storage per person factor, or DSP, attempts to characterize the growth in person-specific data. By dividing the amount of disk storage space sold worldwide in a given year and dividing by the world population at that time, she obtains the GDSP, which she claims is "a crude measure of how much disk storage could possibly be used to collect person-specific data on the world population." The figure below depicts her estimates and illustrates how the GDSP value is growing dramatically.

	1983	1996	2000
GDSP (MB/person)	0.02	28	472

Figure 3: GDSP Over Time

Privacy Concerns

The amount of personal information collected should be enough to raise privacy concerns. However, the real problems arise when we begin to consider the availability of all of this information. As mentioned before, network connectivity is becoming ubiquitous; high-bandwidth connections especially are becoming popular as they become more affordable. Over the years, there has been a noticeable trend in making more databases available online, as well as offline, because of the ease of data transfer that it allows. Some states, such as Texas, have their birth and death registries online, medical data, including hospital discharge data, is readily available, and even health and criminal records are accessible.

The dramatic increase in databases available online is attributable to researchers’ interest in sharing data so that anyone can use the data to aid in their own studies. Some databases may be made available for more superficial reasons such as profit in the case of marketing databases. Along with what appear to be "innocent" databases, there is a great quantity of databases that contain personal, private information. These databases may include health records, police reports, etc. For example, health records can contain abortion records, which many women who have had abortions would surely not want to be made public.

Access Policies

The data holders, often the data collectors themselves, recognize that much of the information they are protecting may be personal, but they are also influenced by the fact that the data they hold may be the key for some important discovery. They are then forced to choose an access policy for their data. Latanya Sweeney also addresses this point in her PhD thesis. She states that there are four basic access policies: (1) private, meaning "insiders only;" (2) semi-private, or "limited access;" (3) semi-public, or "deniable access;" and, (4) public, meaning "no restrictions."

A private database, essentially, is one that is not shared with anyone. Usually, only the data collectors themselves have access to the data. Databases that are semi-private are fairly similar in that they are shared with only a very select few. There is usually some type of rigorous review process before access is granted. For databases that are ruled by either of these access policies the privacy concern is small. The private information is not being shared and data holders probably obtained their subjects’ information directly from them.

The privacy concern is more explicit in databases that are controlled by public or semi-public access policies. Semi-public databases are available to a great number of people. The number of people or entities denied access is very small compared to how many are granted access. Public databases have absolutely no restrictions and are available to anyone who requests access. For databases that contain personal information, but adhere to either of these access policies, the protection of the privacy of their subjects should be paramount. Subjects’ privacy can only be assured by anonymizing the released data.

Usefulness of Data

However, data holders are faced with an additional dilemma — as data is made more anonymous, it becomes less useful. That is, there is an inverse relationship between the anonymity and usefulness of data. For example, a researcher can make much more use of a fully identified database, one that leaves all personally identifiable information, such as name and address, than with purely aggregate statistics. R.J.A Little states that methods to anonymize data "are known to reduce the analytic validity of files," because, as Sweeney explains, "any attempt to provide some anonymity protection, no matter how minimal, involves modifying the data and thereby distorting its contents." Thus, from a researcher’s point of view, no modification of the data is desirable.

The data holder must then determine to what extent the data must be anonymized. This, if possible, can be done on a per-release basis, evaluating the subjects’ privacy against a recipient’s purported need for the information. Sweeney suggests that there are cases where the privacy of the data greatly outweighs any possible need by outsiders. This is the case for classified government data, or a company’s employment records (do not want to give away the names of their high performers). In this case, all information is completely suppressed, i.e. no data is released. At the other extreme, there is the case where the recipient’s need overshadows any privacy concerns. In this case, the data is released with no modifications and all subjects completely identified. An example of this case is a public health official’s request for health records.

In between these two cases, however, there is an extremely wide band. Sweeney describes it as a continuum, with the two cases mentioned as the endpoints. She argues that most cases fall somewhere in this continuum and that the problem then becomes that data holders release data that is too distorted in an effort to anonymize, or is easily reidentifiable. That is, they do not achieve the "optimal release of data" — a release of data that is practically useful yet is minimally invasive to subjects’ privacy.

Deidentification

Since the focus of this document is on subjects’ privacy, we direct our attention to the case where a release of personal data is not completely anonymous. Investigators (i.e. Sweeney, other reidentification researchers, and we) have found that many database releases are made public under the mistaken assumption that simply removing explicit identifiers from the databases’ records makes them anonymous. Explicit identifiers are data fields that contain personally identifiable information; Sweeney defines explicit identifiers as, "a set of data elements, such as {name, address}, for which there exists a direct communication method where with no additional information, the designated person could be directly and uniquely contacted." Although they do not fit the definition of explicit identifiers, Social Security numbers are also usually removed from these supposedly anonymous databases because they are in such widespread use and their holders can be identified easily.

The removal of all explicit identifiers from a database is termed deidentification. It is important to note, however, that although a deidentified database may appear anonymous (see Figure below), it certainly is not. Deidentification is a misnomer, since deidentified data is not equivalent to anonymous data. We define deidentified data simply as data that has undergone deidentification — explicit identifiers have been removed, generalized, or replaced with fictitious data — whereas, anonymous data is data that cannot be manipulated to reidentify the subject of the data.

Figure 4: "Anonymizing" Effect of Deidentification on a Database

Reidentification

The distinction between deidentified data and anonymous data thus lies in the ability to subject the data to reidentification. Reidentification is the discovery, or determination, of the identity of the individuals who are the subjects of a study through data linkage techniques. It only applies to reidentification of subjects when the data holders have attempted to deidentify them in some manner. That is, a fully identified database cannot be said to undergo reidentification.

Within the vast amount of personal information that is being collected as part of the ‘data explosion,’ there is personal data that is extremely private for the subjects, data that they would not be connected to publicly. A later section provides a few examples of these data sets. In most cases, the data is only publicly available because the subjects have been assured of their privacy — they have been assured that the data will be anonymous. Reidentification, then, raises grave privacy concerns because of the simple fact that it voids the attempts of many researchers to protect the privacy they have guaranteed to their subjects. It is a tool for invasion of privacy, and it will be increasingly possible for reidentification to take place, with much greater ease and by a greater number of people, as the amount of data available continues to grow.

Reidentification is a relatively simple concept. It makes use of what Latanya Sweeney terms ‘quasi-identifiers.’ A quasi-identifier is "a set of data elements in entity-specific data that in combination associates uniquely or almost uniquely to an entity and therefore can serve as a means of directly or indirectly recognizing the specific entity that is the subject of the data." It is a combination of characteristics that, combined, can act as a unique or near-unique identifier in the absence of explicit identifiers. For example the set consisting of a person’s home ZIP code, gender, and birth date does not contain any explicit identifiers, but can be a quasi-identifier since this set can uniquely identify a large percentage of the population. Sweeney found that this quasi-identifier made 87% of the population in the United States unique and identifiable; birth date and full ZIP code alone makes 97% of the Cambridge, Massachusetts population identifiable. Basically, a few characteristics can make a person unique.

Using an exhaustive control data set, one can determine a quasi-identifier that can uniquely identify the largest number of individuals. An exhaustive control data set is a data set that contains personal information, including explicit identifiers, about a large percentage of the population from which the subjects of a deidentified database are drawn. For example, voter registration lists contain information such as name, address, ZIP code, birth date, and gender of each voter, in addition to party affiliation and date registered, about a large percentage of adults for specific areas. Thus, they often make excellent control data sets. It is using the Cambridge voter list that Sweeney found that 97% of its population was uniquely identifiable using certain data. It is through the analysis of the voter list as the control data set that she was able to find that the quasi-identifier that would give this high percentage was {full ZIP, birth date}. As the amount of information given in the control data set increases — has more, specific fields — the better a quasi-identifier will be. It is also important to note that a control data set does not have to be public. Companies can use their own employee records as a control database — it contains information about all of its employees!

A data investigator — anyone with data storage space, (network) access, database software (a DBMS), and interest — can then use a good quasi-identifier to match a large number of the subjects of a deidentified database to the individuals named in the control database. That is, he/she will use data linkage techniques to match the private information in the deidentified database to an identity in the control database using the shared quasi-identifier information as the linking data. Figure 5 illustrates this process.

Figure 5: Linking a Deidentified Database with a Control Database
An Example

This subsection provides a simple, complete example of the reidentification process. We include it in order to better explain the procedure and illustrate how easily anyone can perform reidentification of subjects.

The example-deidentified database contains information about subjects who have sexually transmitted diseases (STD). The subjects considered their diagnosis private information and did not want to be identified as having been diagnosed with an STD. The data collectors guaranteed them that their identities would not be made public when they released their patient data. They thus deidentified their data, believing it was rendered anonymous, before releasing it. Figure 6 depicts the data that they made public.

Figure 6: Deidentified Private Information Made Public

Since all of the subjects live in the same area, as specified by the ZIP code field, and are of voting age, a suitable control database would be the voter registration list for their area. It is depicted in Figure 7 below.

Figure 7: A Control Database - Voter Registration List

A data investigator, looking at the two data sets, sees that both contain ZIP, birth date and sex information. This set of data can then be used as a quasi-identifier. Figure 8 illustrates this overlap in data.

Figure 8: Overlap in Data in the Two Data Sets

The data investigator can then attempt to match the subjects in the deidentified patient database with the individuals in the control database using the quasi-identifier as the basis for linkage of diagnosis to identity. The results of this linkage are shown in Figure 9.

Figure 9: A Reidentified Data Set

Although all of the subjects in our deidentified database were reidentified, this is not always the case. Sometimes the control data set does not contain a match, or contains more than one. It might still be possible to positively reidentify the subjects who fall into these categories, however, by looking more closely at other data fields.

Reasons for Reidentification

The reidentification example illustrated how easy it is to do reidentification. However, we are left with the question: Who would reidentify? In fact, there are many people or entities that would be interested in reidentification of private, deidentified data subjects. This section provides a few reasons for which they may use reidentification.

Scientific Research

Scientific research is one of the main reasons much of the data available is ever collected and shared. As scientists form and test their hypotheses using deidentified data sets, they may find that they need additional information about the subjects in order to complete their research. They may need information that is simply more useful than the deidentified information they have. They wish to reidentify the subjects so that they can build a larger profile on each of the subjects, or for a select few.

For example, a medical researcher studying health issues may have a deidentified data set containing certain, general characteristics about some individuals’ medical histories. He finds that a few subjects have data that is unusual, or interesting in some way. If he could identify and contact those subjects in order to obtain more information about them, then it would be greatly beneficial for his research. Although this seems innocent enough, one must consider that some individuals may not want to be contacted or even have their information linked to them by anyone other than their doctor.

Investigative Reporting

Reidentification can be used for many different types of investigative reporting. Reporters may try to link personal information contained in deidentified data sets to celebrities or public officials and report the information gathered about them to the public at large.

Sweeney, in her thesis, provides an event that can be used as an example. She writes, "In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees. GIC collected de-identified patient-specific data with nearly one hundred fields of information per encounter along the lines of the fields discussed in the NAHDO list for approximately 135,000 state employees and their families. Because the data were believed to be anonymous, GIC gave a copy of the data to researchers and sold a copy to industry." Among the data subjects were well-known, high-ranking officials, including the governor. Obviously, if his personal medical data could be reidentified, then the press could quickly make his private medical information public. Actually, Sweeney writes that the governor’s data could be uniquely identified using only his birth date, sex, and five-digit ZIP code.

Marketing

Marketing provides the impetus for much of the increased data collection characteristic of recent years. Marketers want to build the largest profiles about consumers as possible in order to be able to do greater direct marketing. This would allow them to increase profits by narrowing the amount of people the market certain products to, while, at the same time, increasing the probability of success for each direct marketing target.

Just recently, Doubleclick, Inc., an online marketing firm that tracks users browsing habits, sought to reidentify many of its subjects by buying a consumer database. Although it was thwarted by its own privacy policy, the privacy danger was real. Doubleclick would have been in the position to identify individuals with their browsing habits and be able to sell this information to other product or service providers.

Blackmail

Blackmail is an interesting motive for doing reidentification. Although it does not seem apparent that reidentification would be useful for reidentifying information for a particular, specific individual, there is the possibility of reidentifying celebrities, public officials, or anyone else with very personal information that a malicious data investigator may threaten to make public unless the reidentified individual meets some demand.

There are already public databases that contain arrest data for certain police districts. If all such information were made available, then a data investigator could surely reidentify well-known individuals with their arrest record. They could then attempt to blackmail the individuals by threatening to make their record public.

Insurance

Health and life insurance companies have a very real motive for attempting to do reidentification. This may be another reason that the medical field has been attempting to bring attention to the reidentification issue. These insurance companies can attempt to reidentify individuals in deidentified hospital discharge data, which is widely available, or other patient data in order to collect a greater amount of information regarding individuals’ medical histories. They can then use this greater amount of information to deny certain individuals any type of insurance policy.

Political Action

Yet another reason for attempting to do reidentification is for political motives. Recently, there was a case where an anti-abortion group posted the names and addresses of doctors that conducted abortion procedures on women. As doctors were killed, their names would be crossed out on this list. Now, however, with reidentification, it would be possible to identify the actual women who have had abortions. This is a frightening possibility since public disclosure of their identities might subject them to harassment, danger, as well as discourage other women from seeking abortion.

Reidentification of women who have had abortions would be possible because hospitals and clinics collect and share a great amount of patient data. Within this data is also information regarding procedures performed, including abortions. A political activist could then separate out the subjects who are indicated as having had abortions and try to reidentify them.

Database Selection Criteria

Upon deciding we wanted to conduct reidentification experiments, our first step was to locate a database that contained deidentified data. In addition, the candidate data set had to have certain properties in order to be most useful to us. Specifically, it had to be small, since we had never done this before and wanted to start by working on a tractable problem that could be analyzed quickly without expending a great deal of time or computational resources. We also wanted the candidate database to contain incriminating or embarrassing information about the individuals that had been deidentified. After all, there is little point in expending a great deal of energy to reidentify people only to discover trivia. Trivial information about individuals is much less likely to be well-protected using strong deidentification techniques, and as a result, is unlikely to be representative of the challenges involved in reidentifying important data (like health care information).

Another criteria for our candidate data set was that it had to be easy to verify. From the beginning, we felt it was important to not only make successful reidentifications, but to have some method of verifying the legitimacy of those matches. While such considerations are significantly less important in a commercial setting because the cost of being wrong is so low, we did not conduct our experiments in such an environment. In addition, we wanted to focus on an area of that had not been as widely explored as medical data. In particular, Latanya Sweeney has written a great deal on that subject and we feel there is little we could contribute in that area. Finally, we required that the candidate data set be available to the public at large for free or for a nominal fee. While many large corporations and government entities maintain large deidentified data sets for internal use, we felt the best way to illustrate the threat from reidentification would be to only work from publicly sources.

We eventually settled on the Chicago Homicide Data set since it met all the criteria listed above. It was small, contained a wealth of embarrassing information, and was freely available to the general public. Additionally, it was in an area that had not received the strict privacy analysis and regulatory burdens that health care data had recently undergone. The data set contained enough personally identifying fields to make reidentification at least plausible and initially appeared to be easily verifiable, although this later turned out to not be the case.