Retain attributes when using collection from tidyr (attributes are not identical)

I have a data frame that needs to be split into two tables to satisfy the third normal form of Codd. In a simple case, the original data frame looks something like this:

library(lubridate) > (df <- data.frame(hh_id = 1:2, income = c(55000, 94000), bday_01 = ymd(c(20150309, 19890211)), bday_02 = ymd(c(19850911, 20000815)), gender_01 = factor(c("M", "F")), gender_02 = factor(c("F", "F")))) hh_id income bday_01 bday_02 gender_01 gender_02 1 1 55000 2015-03-09 1985-09-11 MF 2 2 94000 1989-02-11 2000-08-15 FF 

When I use the collection function, it warns that the attributes are not identical and loses the coefficient for gender and lubridate for bday (or other attributes in the real world example). Is there a good tidyr solution to avoid losing each type of column data?

 library(tidyr) > (person <- df %>% select(hh_id, bday_01:gender_02) %>% gather(key, value, -hh_id) %>% separate(key, c("key", "per_num"), sep = "_") %>% spread(key, value)) hh_id per_num bday gender 1 1 01 1425859200 M 2 1 02 495244800 F 3 2 01 603158400 F 4 2 02 966297600 F Warning message: attributes are not identical across measure variables; they will be dropped > lapply(person, class) $hh_id [1] "integer" $per_num [1] "character" $bday [1] "character" $gender [1] "character" 

I can imagine a way to do this by collecting each set of variables with the same data type separately, and then connecting to all the tables, but there should be a more elegant solution that I am missing.

+8
r tidyr
source share
2 answers

You can simply convert your dates to a character, and then convert them to dates at the end:

 (person <- df %>% select(hh_id, bday_01:gender_02) %>% mutate_each(funs(as.character), contains('bday')) %>% gather(key, value, -hh_id) %>% separate(key, c("key", "per_num"), sep = "_") %>% spread(key, value) %>% mutate(bday=ymd(bday))) hh_id per_num bday gender 1 1 01 2015-03-09 M 2 1 02 1985-09-11 F 3 2 01 1989-02-11 F 4 2 02 2000-08-15 F 

Alternatively, if you use Date instead of POSIXct , you can do something like this:

 (person <- df %>% select(hh_id, bday_01:gender_02) %>% gather(per_num1, gender, contains('gender'), convert=TRUE) %>% gather(per_num2, bday, contains('bday'), convert=TRUE) %>% mutate(bday=as.Date(bday)) %>% mutate_each(funs(str_extract(., '\\d+')), per_num1, per_num2) %>% filter(per_num1 == per_num2) %>% rename(per_num=per_num1) %>% select(-per_num2)) 

Edit

The warning you see:

 Warning: attributes are not identical across measure variables; they will be dropped 

arises from the collection of gender columns, which are factors and have different level vectors (see str(df) ). If you had to convert the gender columns to a character or if you had to synchronize their levels with something like

 df <- mutate(df, gender_02 = factor(gender_02, levels=levels(gender_01))) 

then you will see that the warning goes away when you execute

 person <- df %>% select(hh_id, bday_01:gender_02) %>% gather(key, value, contains('gender')) 
+6
source share

You do not like my basic decisions. Let me seduce you again

 (df <- data.frame(hh_id = 1:2, income = c(55000, 94000), bday_01 = ymd(c(20150309, 19890211)), bday_02 = ymd(c(19850911, 20000815)), gender_01 = factor(c("M", "F")), gender_02 = factor(c("F", "F")))) reshape(df, idvar = 'hh_id', varying = list(3:4, 5:6), direction = 'long', v.names = c('bday','gender'), timevar = 'per_num') # hh_id income per_num bday gender # 1.1 1 55000 1 2015-03-09 M # 2.1 2 94000 1 1989-02-11 F # 1.2 1 55000 2 1985-09-11 F # 2.2 2 94000 2 2000-08-15 F 
+2
source share

All Articles