Number of numbers per line using dplyr

Question

Number of numbers per line using dplyr

This question should have a simple, elegant solution, but I can't figure it out, so here it is:

Let's say I have the following dataset, and I want to count the number of two present on each row using dplyr.

set.seed(1) ID <- LETTERS[1:5] X1 <- sample(1:5, 5,T) X2 <- sample(1:5, 5,T) X3 <- sample(1:5, 5,T) df <- data.frame(ID,X1,X2,X3) library(dplyr)

Now the following work is performed:

 df %>% rowwise %>% mutate(numtwos = sum(c(X1,X2,X3) == 2))

But how can I avoid entering all column names?

I know that it is probably easier to do without dplyr , but overall I want to know how I can use dplyr mutate with multiple columns without entering all the column names.

+3

r dplyr

C_z_ Jun 09 '16 at 16:54

source share

4 answers

Here is another alternative using purrr :

 library(purrr) df %>% by_row(function(x) { sum(x[-1] == 2) }, .to = "numtwos", .collate = "cols" )

What gives:

 #Source: local data frame [5 x 5] # # ID X1 X2 X3 numtwos # <fctr> <int> <int> <int> <int> #1 A 2 5 2 2 #2 B 2 5 1 1 #3 C 3 4 4 0 #4 D 5 4 2 1 #5 E 2 1 4 1

As pointed out by NEWS , string-based functionalities still mature in dplyr :

We are still dplyr out what belongs to dplyr and what belongs to purrr . Expect a lot of experimentation and a lot of changes with these features.

Benchmark

We can see how rowwise() and do() compared with purrr::by_row() for this type of problem and how they "perform" against rowSums() and the way the data is neat:

 largedf <- df[rep(seq_len(nrow(df)), 10e3), ] library(microbenchmark) microbenchmark( steven = largedf %>% by_row(function(x) { sum(x[-1] == 2) }, .to = "numtwos", .collate = "cols"), psidom = largedf %>% rowwise %>% do(data_frame(numtwos = sum(.[-1] == 2))) %>% cbind(largedf, .), gopala = largedf %>% gather(key, value, -ID) %>% group_by(ID) %>% summarise(numtwos = sum(value == 2)) %>% inner_join(largedf, .), evan = largedf %>% mutate(numtwos = rowSums(. == 2)), times = 10L, unit = "relative" )

Results:

 #Unit: relative # expr min lq mean median uq max neval cld # steven 1225.190659 1261.466936 1267.737126 1227.762573 1276.07977 1339.841636 10 b # psidom 3677.603240 3759.402212 3726.891458 3678.717170 3728.78828 3777.425492 10 c # gopala 2.715005 2.684599 2.638425 2.612631 2.59827 2.572972 10 a # evan 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 10 a

+5

Steven beaupré Jun 09 '16 at 17:44

source share

One approach is to use a combination of dplyr and tidyr to convert data to a long format and perform calculations:

 library(dplyr) library(tidyr) df %>% gather(key, value, -ID) %>% group_by(ID) %>% summarise(numtwos = sum(value == 2)) %>% inner_join(df, .)

The output is as follows:

  ID X1 X2 X3 numtwos 1 A 2 5 2 2 2 B 2 5 1 1 3 C 3 4 4 0 4 D 5 4 2 1 5 E 2 1 4 1

+2

Gopala Jun 09 '16 at 17:32

source share

You can use do , which does not add a column to the original data frame, and you need to add the column to the original data frame.

 df %>% rowwise %>% do(numtwos = sum(.[-1] == 2)) %>% data.frame numtwos 1 2 2 1 3 0 4 1 5 1

Add cbind to bind the new column to the original data frame:

 df %>% rowwise %>% do(numtwos = sum(.[-1] == 2)) %>% data.frame %>% cbind(df, .) ID X1 X2 X3 numtwos 1 A 2 5 2 2 2 B 2 5 1 1 3 C 3 4 4 0 4 D 5 4 2 1 5 E 2 1 4 1

+1

Psidom Jun 09 '16 at 17:33

source share

evan.oman · Accepted Answer · 2016-06-09T16:59:17+0000

Try rowSums :

 > set.seed(1) > ID <- LETTERS[1:5] > X1 <- sample(1:5, 5,T) > X2 <- sample(1:5, 5,T) > X3 <- sample(1:5, 5,T) > df <- data.frame(ID,X1,X2,X3) > df ID X1 X2 X3 1 A 2 5 2 2 B 2 5 1 3 C 3 4 4 4 D 5 4 2 5 E 2 1 4 > rowSums(df == 2) [1] 2 1 0 1 1

Alternatively, with dplyr :

 > df %>% mutate(numtwos = rowSums(. == 2)) ID X1 X2 X3 numtwos 1 A 2 5 2 2 2 B 2 5 1 1 3 C 3 4 4 0 4 D 5 4 2 1 5 E 2 1 4 1

Number of numbers per line using dplyr

More articles: