Researchers collecting data about individuals have various obligations to meet if they wish to share or publish the data. They may be required to seek written consent from participants to share their data, or ensure that it is de-identified to a point where it can’t be re-identified. The following gives two case studies of occasions where data custodians thought that they had adequately de-identified data, but close examination found that re-identification may be possible.
Defining personal information
Under the NSW Privacy and Personal Information Protection Act, personal information is
“any information or opinion (including information or an opinion forming part of a database and whether or not recorded in a material form) about an individual whose identity is apparent or can reasonably be ascertained from the information or opinion”.
Personal information can include:
Name, address, email address, phone number, date of birth, photographs, voice and video recordings, biometric information, IP address, location information from a mobile device, and government identifiers such as tax file number, Medicare number and driver’s licence number.
Sensitive personal information includes information relating to an individual’s ethnic or racial origin, political opinions, religious or philosophical beliefs, trade union membership or sexual activities. The Australian Privacy Act recognises sensitive personal information as a special class of personal information.
The NSW Health Records Information Protection Act defines health information as personal information that is about an individual’s physical or mental health, health services they have received, genetic information, healthcare identifiers, and information collected in the course of providing health care or donating body parts or organs.
De-identification
De-identifying data is not a straightforward process and researchers must take a number of factors into consideration prior to publishing data. As well as direct identifiers within the dataset itself, contextual information may make it possible to identify individuals. Thus, even if researchers have removed these identifiers, the dataset may not be suitable for publication if the participants consented to sharing of de-identified data. Refer to the de-identification pages from the privacy office for further guidance.
Case study: External data sources
In 2016 the Australian Department of Health published a dataset of Medicare and PBS data that researchers at the University of Melbourne were able to demonstrate had not been adequately de-identified. Cases in which they were able to re-identify data included cross-referencing surgery information with media reports of prominent individuals’ surgery, and using the dates that a woman gave birth, to identify individuals. Additionally, the researchers found that they could use the provider numbers in the dataset to identify multiple patients of the same medical practice. The Australian Privacy Commissioner found that the Department of Health had committed a privacy breach in publishing the dataset. The important lesson in this case is that an individual need not be identifiable to any casual observer but that someone with additional contextual information could identify an individual.
Case study: A survey of cancer patients
Researchers from UTS sought to publish responses to a survey of cancer patients about their experiences. All survey respondents provided information about the type of cancer they had had, when it had been diagnosed, how it had been treated, and their current status. In addition, respondents were given a number of opportunities to provide contextual comments throughout the survey. Close reading of these comments identified several instances of information that could, in context with other information identify respondents included:
Contact details: One individual included their email address in their response. Email addresses, along with telephone numbers, are generally considered to be personal information.
Dates of military service: The dates of someone’s tenure of a job, including military service, can be used in combination with other data to identify them.
Treatment centre: Information about the specific centre that an individual attended, alongside the information about their cancer type, dates, and treatment could allow identification.
Participation in a support program: Several members of a sporting team for cancer survivors mentioned joining this team, making them potentially identifiable to each other. Several other respondents mentioned a specific support or exercise program that they had accessed. Depending on the size of the group or program, this may make them identifiable to other participants.
Lessons
Researchers need to be alert to the possibility that participants in a study may be identifiable from a dataset even if direct identifiers have been removed. Factors that may make participants identifiable include membership of a small population (e.g., having a rare disease, being a member of a small team in an organisation), or being an outlier in a study (e.g., significantly older than most of the cohort).
Researchers need to take into consideration the possibility that external information may make an individual identifiable. For example, an athlete whose knee reconstruction was reported in the media may be identifiable in a dataset of knee surgery patients. Members of the study may also be able to identify each other from details in the data, for example as having attended the same program.
Researchers seeking to publish survey data should pay particular attention to free-text responses when preparing their dataset as participants may include information that could identify them, either directly or in context with other data.