How do I split a DataFrame line by line into pieces of n, apply a function, and combine?

Question

How do I split a DataFrame line by line into pieces of n, apply a function, and combine?

I have a data.frame of data.frame rows.

 > head(dt) mLow1 mHigh1 mLow2 mHigh2 meanLow meanHigh fc mean A_00001 37.00 12.75 99.25 78.50 68.125 45.625 1.4931507 56.8750 A_00002 31.00 21.50 84.75 53.00 57.875 37.250 1.5536913 47.5625 A_00003 72.50 26.50 81.75 74.75 77.125 50.625 1.5234568 63.8750

I want to divide data.frame by 12, apply the scale function in the fc column, and then combine it. There is no grouping variable here, otherwise I would use ddply . In addition, since 130,209 is not completely divisible by 12, the resulting data.frames will be unbalanced, that is, 11 data.frame will have 10,851 rows, and the latter will contain 10,848 rows, but this is fine.

So, how do I split data.frame into a string into pieces of n (in this case 12), apply a function, and then combine them? Any help would be greatly appreciated.

Update : Using the two best solutions, I get different results: Using the @Ben Bolker solution,

 mLow1 mHigh1 mLow2 mHigh2 UID gene_id meanLow meanHigh mean fc 1.5 3.25 1 1.25 MGLibB_00021 0610010K14Rik 1.25 2.25 1.75 -0.5231249 1.5 3.25 1 1.25 MGLibA_00034 0610037L13Rik 1.25 2.25 1.75 -0.5231249 1.5 3.25 1 1.25 MGLibB_00058 1100001G20Rik 1.25 2.25 1.75 -0.5231249 1.5 3.25 1 1.25 MGLibA_00061 1110001A16Rik 1.25 2.25 1.75 -0.5231249 1.5 3.25 1 1.25 MGLibA_00104 1110034G24Rik 1.25 2.25 1.75 -0.5231249 1.5 3.25 1 1.25 MGLibA_00110 1110038F14Rik 1.25 2.25 1.75 -0.5231249

Using @MichaelChirico's answer:

 mLow1 mHigh1 mLow2 mHigh2 UID gene_id meanLow meanHigh mean fc fc_scaled 1.5 3.25 1 1.25 MGLibB_00021 0610010K14Rik 1.25 2.25 1.75 0.5555556 -0.5089608 1.5 3.25 1 1.25 MGLibA_00034 0610037L13Rik 1.25 2.25 1.75 0.5555556 -0.5089608 1.5 3.25 1 1.25 MGLibB_00058 1100001G20Rik 1.25 2.25 1.75 0.5555556 -0.5089608 1.5 3.25 1 1.25 MGLibA_00061 1110001A16Rik 1.25 2.25 1.75 0.5555556 -0.5089608 1.5 3.25 1 1.25 MGLibA_00104 1110034G24Rik 1.25 2.25 1.75 0.5555556 -0.5089608 1.5 3.25 1 1.25 MGLibA_00110 1110038F14Rik 1.25 2.25 1.75 0.5555556 -0.5089608

+5

split r apply

Komal Rathi Jul 31 '15 at 19:24

source share

3 answers

I'm not sure if the dt structure matters (unless you use any internal values for separation). Does it help?

  spl.dt <- split( dt , cut(1:nrow(dt), 12) ) lapply( spl.dt, my_fun)

+4

42- Aug 1 '15 at 3:51

source share

Using data.table you can do:

 library(data.table) setDT(dt)[,scale(fc),by=rep(1:nn,each=ceiling(KK/nn),length.out=KK)]

Here KK is 130 209, and nn is 12. Playable data:

 set.seed(100) KK<-130209L; nn<-12L dt<-data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK), mLow2=rnorm(KK),mHigh2=rnorm(KK), meanLow=rnorm(KK),meanHigh=rnorm(KK), fc=rnorm(KK),mean=rnorm(KK))

So there is no need to separate the data and recombine.

If you want to add this to the data frame, and not just extract it, you can use the := operator to assign by reference:

 setDT(dt)[,fc_scaled:=scale(fc)...]

+2

MichaelChirico Jul 31 '15 at 19:44

source share

Ben bolker · Accepted Answer · 2015-07-31T19:32:30+0000

ggplot2 has a convenience function cut_number() that will do this for you. If you don't want the overhead of downloading this package, you can look at ggplot2:::breaks for the necessary logic.

Playable example stolen from @MichaelChirico:

 set.seed(100) KK<-130209L; nn<-12L library("dplyr") dt <- data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK), mLow2=rnorm(KK),mHigh2=rnorm(KK), meanLow=rnorm(KK),meanHigh=rnorm(KK), fc=rnorm(KK),mean=rnorm(KK)) %>% arrange(mean)

We apologize to those who do not like pipes:

 library("ggplot2") ## for cut_number() dt %>% mutate(grp=cut_number(mean,12)) %>% group_by(grp) %>% mutate(fc=c(scale(fc))) %>% ungroup() %>% select(-grp) %>% ## drop grouping variable as.data.frame -> dt2 ## convert back to data frame, assign result

It turns out that c() needed around scale() , otherwise the fc variable ends with some attributes that confuse tail() ...

The same logic should apply to using plyr or the underlying R split-apply-comb as well (the key uses cut_number() to define the grouping variable).

How do I split a DataFrame line by line into pieces of n, apply a function, and combine?

More articles: