How can I de-and reclassify data?

Some of the data I work with contains confidential information (names of individuals, dates, locations, etc.). But sometimes I have to share “numbers” with other people in order to get help in statistical analysis or process it on more powerful machines, where I can’t control who is looking at the data.

Ideally, I would like to work as follows:

  • Read the data in R (look at it, clean it, etc.).
  • Select the data frame that I want to de-categorize, run it through the package and get two “files”: the de-classified data and the translation file. I will save the last.
  • Disabled data can be shared, processed and processed without worries.
  • I reclassify the processed data along with the translation file.

I believe this can also be useful when uploading data for processing in the cloud (Amazon, etc.).

Have you been in this situation? At first I thought about writing a “randomize” function, but then I realized that there was no end to how difficult it was to do (for example, compensate for timestamps without losing order). Maybe there is already a certain method or tool?

Thanks to everyone who contributes to the [r] -tag here on Stack Overflow!

+7
source share
3 answers

One way to do this is match . First I create a small data frame:

 foo <- data.frame( person=c("Mickey","Donald","Daisy","Scrooge"), score=rnorm(4)) foo person score 1 Mickey -0.07891709 2 Donald 0.88678481 3 Daisy 0.11697127 4 Scrooge 0.31863009 

Then I make a key:

 set.seed(100) key <- as.character(foo$person[sample(1:nrow(foo))]) 

You should save this key, obviously, somewhere. Now I can code people:

 foo$person <- match(foo$person, key) foo person score 1 2 0.3186301 2 1 -0.5817907 3 4 0.7145327 4 3 -0.8252594 

If I want the names of people again, I can index key :

 key[foo$person] [1] "Mickey" "Donald" "Daisy" "Scrooge" 

Or use tranform , it also works if the data is changed, if the person’s ID remains unchanged:

 foo <-rbind(foo,foo[sample(1:4),],foo[sample(1:4,2),],foo) foo person score 1 2 0.3186301 2 1 -0.5817907 3 4 0.7145327 4 3 -0.8252594 21 1 -0.5817907 41 3 -0.8252594 31 4 0.7145327 15 2 0.3186301 32 4 0.7145327 16 2 0.3186301 11 2 0.3186301 12 1 -0.5817907 13 4 0.7145327 14 3 -0.8252594 transform(foo, person=key[person]) person score 1 Mickey 0.3186301 2 Donald -0.5817907 3 Daisy 0.7145327 4 Scrooge -0.8252594 21 Donald -0.5817907 41 Scrooge -0.8252594 31 Daisy 0.7145327 15 Mickey 0.3186301 32 Daisy 0.7145327 16 Mickey 0.3186301 11 Mickey 0.3186301 12 Donald -0.5817907 13 Daisy 0.7145327 14 Scrooge -0.8252594 
+3
source

Can you just assign a GUID to the line from which you deleted all sensitive information? As long as your colleagues who do not have security permissions bother with the GUID, you can enable any changes and additions that they can make by simply joining the GUID. Then it's just a matter of creating fake ersatz values ​​for the columns whose data you cleared. LastName1, LastName2, City1, City2, etc. EDIT: you will have a table for each cleared column, for example. City, State, Zip, FirstName, LastName, each of which contains a separate set of real classified values ​​in this column and an integer value. So that “Jones” can be represented in a disinfected data set, for example, LastName22, “Schenectady”, like City343, “90210”, like Zipcode716. This will give your colleagues real values ​​for work (for example, they will have the same number of different cities as your real data, only with anonymized names), and the relationships of anonymized data will be preserved. EDIT2: if the goal is to provide your colleagues with data that is still statistically significant, then the date columns will require special processing. For example. if your colleagues should make statistical calculations for the person’s age, you should give them something close to the original date, not so close that it could be revealed, but so far this could not distort the analysis.

+3
source

Sounds like a statistical disclosure management issue. Check out the sdcMicro package.

EDIT: Just realized you have a slightly different problem. The statistical disclosure management point is to “corrupt” the data in order to reduce the risk of information disclosure. According to the "damaging" data, you lose some information - this is the price you pay for the reduced risk of disclosure. Your data will contain less information, so your analysis may produce different or lesser results, like analysis on the source data.

Depends on what you are going to do with your data.

+3
source

All Articles