Understanding and external function in R

Suppose I have data that looks like this:

ID ABC 1 X 1 10 1 X 2 10 1 Z 3 15 1 Y 4 12 2 Y 1 15 2 X 2 13 2 X 3 13 2 Y 4 13 3 Y 1 16 3 Y 2 18 3 Y 3 19 3 Y 4 10 

I wanted to compare these values ​​with each other, so if the identifier changed its value for variable A during the period of variable B (which is from 1 to 4), it goes to data frame K, and if it does not then it goes to data frame L .

therefore, in this dataset, K will look like

 ID ABC 1 X 1 10 1 X 2 10 1 Z 3 15 1 Y 4 12 2 Y 1 15 2 X 2 13 2 X 3 13 2 Y 4 13 

and L will look like

 ID ABC 3 Y 1 16 3 Y 2 18 3 Y 3 19 3 Y 4 10 

In terms of nested loops and if then else, this can be solved as shown below.

 for ( i in 1:length(ID)){ m=0 for (j in 1: length(B)){ ifelse( A[j] == A[j+1],m,m=m+1) } ifelse(m=0, L=c[,df[i]], K=c[,df[i]]) } 

In some posts, I read that in R-nested loops, you can replace the apply and outer functions. if someone can help me understand how it can be used in such circumstances.

+4
source share
3 answers

Thus, in principle, you do not need a conditional loop here, all you have to do is check if the variance exists (and then convert it to logical with ! ) In A during each cycle of B ( ID s) by converting A to a numerical value (I assume its factor in your real dataset, if it is not a factor, you can use FUN = function(x) length(unique(x)) inside ave instead) and then split accordingly. With base R, we can use ave for such a task, for example

 indx <- !with(df, ave(as.numeric(A), ID , FUN = var)) 

Or (if A is a character and a factor )

 indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L 

Then just run split

 split(df, indx) # $`FALSE` # ID ABC # 1 1 X 1 10 # 2 1 X 2 10 # 3 1 Z 3 15 # 4 1 Y 4 12 # 5 2 Y 1 15 # 6 2 X 2 13 # 7 2 X 3 13 # 8 2 Y 4 13 # # $`TRUE` # ID ABC # 9 3 Y 1 16 # 10 3 Y 2 18 # 11 3 Y 3 19 # 12 3 Y 4 10 

This will return a list with two data frames.


Similar to data.table

 library(data.table) setDT(df)[, indx := !var(A), by = ID] split(df, df$indx) 

Or dplyr

 library(dplyr) df %>% group_by(ID) %>% mutate(indx = !var(A)) %>% split(., indx) 
+5
source

Since you want to understand apply and not just do it, you can consider tapply . As a demonstration:

 > tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L")) 1 2 3 "K" "K" "L" 

In a slightly simplified English: go through all the df$A grouped by df$ID and apply the function in df$A to each group (that is, x in the built-in function): if the number of unique values ​​is more than 1, this is "K" otherwise it is "L".

+2
source

We can do this using data.table . We will convert 'data.frame' to 'data.table' ( setDT(df1) ). We are grouped by 'ID', we check the length of unique elements in 'A' ( uniqueN(A) ) is greater than 1 or not, create an ind column on it. We can then split data set based on the fact that the column is "ind".

  library(data.table) setDT(df1)[, ind:= uniqueN(A)>1, by = ID] setDF(df1) split(df1[-5], df1$ind) #$`FALSE` # ID ABC #9 3 Y 1 16 #10 3 Y 2 18 #11 3 Y 3 19 #12 3 Y 4 10 #$`TRUE` # ID ABC #1 1 X 1 10 #2 1 X 2 10 #3 1 Z 3 15 #4 1 Y 4 12 #5 2 Y 1 15 #6 2 X 2 13 #7 2 X 3 13 #8 2 Y 4 13 

Or in a similar way, with dplyr we can use n_distinct to create a logical column and then split on that column.

 library(dplyr) df2 <- df1 %>% group_by(ID) %>% mutate(ind= n_distinct(A)>1) split(df2, df2$ind) 

Or a base R with table parameter. We get the table first two columns of 'df1', i.e. 'ID' and 'A'. By double negating ( !! ) the output, we can get the value "0" in "TRUE", and all other frequencies - "FALSE". Get rowSums ('indx'). We map the ID column in 'df1' to names 'indx', use this to replace 'ID' with TRUE/FALSE and split dataset with this.

  indx <- rowSums(!!table(df1[1:2]))>1 lst <- split(df1, indx[match(df1$ID, names(indx))]) lst #$`FALSE` # ID ABC #9 3 Y 1 16 #10 3 Y 2 18 #11 3 Y 3 19 #12 3 Y 4 10 #$`TRUE` # ID ABC #1 1 X 1 10 #2 1 X 2 10 #3 1 Z 3 15 #4 1 Y 4 12 #5 2 Y 1 15 #6 2 X 2 13 #7 2 X 3 13 #8 2 Y 4 13 

If we need to get separate datasets in a global environment, change the names of the list elements to the names of the objects we need and use list2env (not recommended)

 list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv) 
+2
source

All Articles