Techniques for obfuscating cluster data and maintaining privacy in r

Question

Techniques for obfuscating cluster data and maintaining privacy in r

background

I have some personal survey data that contains a column of confidential information: the geographic location of the survey respondents. Under no circumstances may this information be released.

as is usually the case in research studies, so that users can correctly calculate the variance in my survey data set, these users will either need this geographic location (unacceptable), or, alternatively, a weight replication set. I can create this set of replicated weights; however, it is fairly easy to look at the correlations between these weights and back-calculate which respondents have the same geographical location. this is also unacceptable.

to help me in this matter, you don’t have to be familiar with replicate weights - just think of them as several columns of highly correlated cluster data.

I understand that if I want to support this clustering, the user of data on malicious cases will always have semi-decent guesses about who shares the geographical locations; I just want to make this guessing game less accurate. on un-obfuscated weight replication, an evil data user can calculate 100% of cases.

request

I'm looking for a technique that

prevents users of shared files from easily deriving a common geographic location from correlations between my replication variables.
does not erase the correlations between my data columns (replicated weight variables)
can be implemented on an R data.frame object without significant investment

I say, because an evil user may not know where the place is, but they may know if two respondents from the same place are an unacceptable opportunity.

what i tried

I really don't want to reinvent the wheel here. I am looking for r syntax, r package or something else that would be relatively simple to implement. I found one , two , three , four documents describing methods that would be suitable for my purposes; Unfortunately, none of the authors wanted to share the actual code for their implementation.

I can do simple things, such as adding and subtracting random values into my weight replication columns according to the normal distribution, but I would rather rely on the work of someone who understands privacy better than me.

thanks!!!!

+8

r privacy obfuscation

Anthony damico Jun 13 '14 at 9:59

source share

1 answer

Anthony damico · Accepted Answer · 2014-06-15T10:38:18+0000

I wrote this nine-step tutorial to go through the process, trying to answer my own question. I am not an expert in the field of confidentiality / confidentiality and would like to hear both reviews about this idea and other ideas. thanks!

http://www.asdfree.com/2014/09/how-to-provide-variance-calculation-on.html

Techniques for obfuscating cluster data and maintaining privacy in r

More articles: