Edwards, A. & Schatschneider, C. (2020). 5 things to check for data de-identification. Available at https://venngage.net/ps/5p6yjaAGTSs/new-5-things-to-check-for-data-deidentification. Reuse available under a CC BY 4.0 license.
Data De-identification Guide
Paper by Schatschneider, Edwards & Shero, 2021, available under a CC By 4.0 license.
With any human-subject data, we must always be concerned with protecting the confidentiality of our participants when sharing data. We can do this through data de-identification. There are a number of steps you can take in order to minimize the risk of a participant being identified in your open-access data. The first step is to remove any obvious identifiers such as names, addresses, phone numbers, birthdates, social security numbers, and school and teacher names. When studying children, age is usually a particularly important variable, so consider converting birthdates and test dates to a calculation of age at each collection point. There are other demographic variables that need such as race/ethnicity, SES, and English Language Proficiency. For example, there may be a small group of students who identify in a racial/ethnic group or have information about language proficiency that has very small representation in the sample. One possible strategy would be to bin all the students from these small groups into a larger group called “Other”. Another strategy would be to remove these variables from the dataset altogether, but report the breakdown of race/ethnicity, SES, or English language proficiency as a frequency. It is also important to consider these identifying variables in concert with each other. That is, it may be possible to “cross-tabulate” demographic information is such that the combination of variables allows for the identification of a single individual.
For some studies, the removal or binning of this information alone will be sufficient. However, if your data are at heightened risk of being sorted or disaggregated in such a way that a single person could become identifiable there are several additional strategies you can employ to increase confidentiality.
Each of the following strategies can be used to further reduce the likelihood that a single case will be identified in your dataset.
First, you can use truncation (top or bottom-coding) to recode or truncate any extreme values so that no single person has the lowest or highest value, but instead perhaps the lowest or highest 5 people in the study receive the same score. As a similar option, you can use rounding to recode values into larger bins across the entire distribution, so no case is the only case with any given value.
Another method is to deliberately introduce noise; adding/subtracting the observed scores by a small random number so that each subject's exact score cannot be known. A similar noise-introducing strategy is to randomly replace one individual's reported value with the average of their small group (also called blurring). These strategies should be employed with caution, as adding a small amount of random noise can reduce estimated relationships, and blurring can cause an overestimation of relationships.
Finally, rank swapping (also called switching) is particularly useful when geographical information (such as school or district) might be ascertainable. In rank swapping, subjects across schools might be matched on all relevant variables and then swapped, with the idea that some of the subjects in a school may not in fact be from the school, but their data is close enough to represent the student being swapped. This should also be approached with caution .
The overall goal of these strategies is to try to ensure confidentiality. If you still feel that the risk of sharing your data is too great, or if there is no way to conceivably protect confidentiality, one last-resort possibility is to not share any individualized data. Instead, you can choose to only share summary statistics such as means, standard deviations, sample sizes and covariance matrices, and include them broken down by as many subgroups as possible. Even these summary statistics can provide opportunities for re-examination and reanalysis.
We have provided an infographic on LDbase to help guide the deidentification process . Once you reach Step 5 you are ready to begin the last step. Investigating whether combinations of variables might identify a particular individual. We have provided some example code in SPSS, SAS and R to help in this process.
To do this in R:
table(dataset$variable)
To do this in SAS:
PROC FREQ DATA=dataset ; TABLES variable /list; RUN;
To do this in SPSS:
- Click on Analyze -> Descriptive Statistics -> Frequencies.
- Move the variable of interest into the right-hand column.
- Click on the Chart button, select Histograms, and the press the Continue button.
- Click OK to generate a frequency distribution table.
If any group has a frequency less than 5 then this could potentially contribute to re-identification. In order to lessen this risk, there are a few things that you can do:
For categorical groups:
Recode the variable to contain larger groups. For example, if you have 30 White participants, 15 Black participants, 3 Asian participants, and 2 Native American participants in your sample. Consider creating an other category to encompass both Asian and Native American participants.
To do this in R:
dataset$race_recoded = dataset$race dataset$race_recoded[dataset$race== “Asian”] = “Other” dataset$race_recoded[dataset$race== “Native American”] = “Other”
To do this in SAS:
data dataset; set dataset; if race = “Asian” then race = “Other”; if race = “Native American” then race = “Other”; run;
To do this in SPSS:
recode race (‘Asian’ = ‘Other’) (‘Native American’ = ‘Other’) (else = copy) into race_recoded. exe.
For continuous or ordered categories:
One possibly identifying variable is one that contains extreme values singling out that observation from the rest. For example, if a variable like income has most values ranging from $20,000-$200,000 and only a handful of values above $200,000. A salary of $400,000 is instantly identifiable when there is only one, thus, the de-identification technique called top or bottom-coding will help to eliminate this instant identification. Top or bottom-coding involves the truncation of extreme codes for a given variable such that any values above (or below) a given value are given that value. In this example, all individuals with incomes greater than $200,000 will be given the code of $200,000 rather than their actual numerical value. This puts multiple extreme values into one category, preventing identification based on unique values while maintaining maximal information provided by that variable.
To do this in R:
dataset$income_recoded = dataset$income dataset$income_recoded[dataset$income>200000]=200000
To do this in SAS:
data dataset; set dataset; if income >200000 then income=200000; run;
To do this in SPSS:
IF (income GT 200000) income = 200000. exe.
Another possible identification problem with a numerical variable like income is that the unique values themselves may lead to identification. For example, if there is only one value of $23,247, even though there are plenty of other values right around it, that exact value may lead to the ability to identify that observation. Thus, the use of recoding into intervals or rounding would eliminate this issue. When recoding into intervals or round, values within a given range are all given the same value. This can involve either setting specific ranges of values at a given interval or rounding to level that appropriately groups values to a level at which multiple values are contained within each group. For example, here we could use a $10,000 interval for income such that all values between $20,000 and $30,000 are grouped together, $30,000-$40,000 and so on. This maintains the relative value of the variable while eliminating the exact numerical value that exposes it to re-identification problems.
To do this in R:
dataset$income_recoded = “” dataset$income_recoded[dataset$income<10000]= “0-10000” dataset$income_recoded[(dataset$income>=10000)&(dataset$income<20000)]= “10000-20000” dataset$income_recoded[(dataset$income>=20000)&(dataset$income<30000)]= “20000-30000” dataset$income_recoded[(dataset$income>=30000)&(dataset$income<40000)]= “30000-40000” …and so on
To do this in SAS:
data dataset; set dataset; if income <10000 then income_recoded= “0-10000”; if income >=10000 and income<20000 then income_recoded= “10000-20000”; if income >=20000 and income<30000 then income_recoded= “20000-30000”; …and so on
To do this in SPSS:
RECODE income (Lowest thru 10000 = ‘0-10000’) (10000 thru 20000= ‘10000-20000’) (20000 thru 30000 = ‘20000-30000’) (and so on) into income_recoded. EXECUTE.
Furthermore, the combination of multiple demographic variables may lead to cell sizes smaller than 5. Check the frequency in each cell when variables are combined.
To do this in R:
table(dataset$variable1, dataset$variable2, dataset$variable3)
To do this in SAS:
PROC FREQ DATA=dataset ; TABLES variable1 * (variable2 varaible3) / nocol norow nopercent list; RUN;
To do this in SPSS:
To create a crosstab, click Analyze > Descriptive Statistics > Crosstabs.
Individual variables should be recoded using the above in a way that creates larger cell sizes when combined with other variables. If recoding cannot eliminate small cell sizes, consider removing a problematic variable before uploading.
To do this in R:
Newdata = dataset[,-which(colnames(dataset)=="variable1")]
To do this in SAS:
drop variable1;
To do this in SPSS:
To delete an existing variable from a dataset: •In the Data View tab, click the column name (variable) that you wish to delete. This will highlight the variable column. •Press Delete on your keyboard, or right-click on the selected variable and click “Clear.” The variable and associated values will be removed.
Infographic