background
I have some personal survey data that contains a column of confidential information: the geographic location of the survey respondents. Under no circumstances may this information be released.
as is usually the case in research studies, so that users can correctly calculate the variance in my survey data set, these users will either need this geographic location (unacceptable), or, alternatively, a weight replication set. I can create this set of replicated weights; however, it is fairly easy to look at the correlations between these weights and back-calculate which respondents have the same geographical location. this is also unacceptable.
to help me in this matter, you donβt have to be familiar with replicate weights - just think of them as several columns of highly correlated cluster data.
I understand that if I want to support this clustering, the user of data on malicious cases will always have semi-decent guesses about who shares the geographical locations; I just want to make this guessing game less accurate. on un-obfuscated weight replication, an evil data user can calculate 100% of cases.
request
I'm looking for a technique that
- prevents users of shared files from easily deriving a common geographic location from correlations between my replication variables.
- does not erase the correlations between my data columns (replicated weight variables)
- can be implemented on an R
data.frame object without significant investment
I say, because an evil user may not know where the place is, but they may know if two respondents from the same place are an unacceptable opportunity.
what i tried
I really don't want to reinvent the wheel here. I am looking for r syntax, r package or something else that would be relatively simple to implement. I found one , two , three , four documents describing methods that would be suitable for my purposes; Unfortunately, none of the authors wanted to share the actual code for their implementation.
I can do simple things, such as adding and subtracting random values ββinto my weight replication columns according to the normal distribution, but I would rather rely on the work of someone who understands privacy better than me.
thanks!!!!
r privacy obfuscation
Anthony damico
source share