I have an R data frame with two columns. Column x categorical and column y is continuous. Here is an example:
library(dplyr) x <- c(1,1,1,1,1,1,1,2,2,2,2,2,3,3,4,4,4,4,4,4,4,4,4,4) y <- runif(length(x), 0, 1) df <- data.frame(x,x) df_sum <- df %>% group_by(x) %>% summarise(count = n())
Think of each categorical value, such as the identifier of a series of a particular type and y, as the values ββin this series. In the end, I want to be able to compare the selected subset of all possible series using the my_func() function.
First, I need to define "good" tuples and create interable for use in the second part of the task.
To find "good" tuples, I need to compare the number of rows for each categorical value of x in df_sum . I want to find all combinations of categorical values ββof x , where the ratio of the number of observations is between 0.9 and 1.5.
For example, x_1=7 and x_2=5 , and x_1/x_2=1.4 falls into this range. Therefore, I want to save the tuple (1,2) .
my_func(s1,s2)=my_func(s2,s1)
Therefore, I do not need to save (2,1) if I already have (1,2) . Once I have all the good tuples, I want to scroll through them and run the function my_func(s1, s2) and save (s1, s2, my_func(s1,s2)) in the data frame.
If good_tuples were a Python-like list [(1,2),...] , I would write pseudocode, for example:
for tuple in good_tuples: s1 <- df[df$x==tuple[0],'y'] s2 <- df[df$x==tuple[1],'y'] my_func(s1, s2)
Ideally, I could run the loop in parallel with something like mapply.