How to get a unique counter of identifiers by columns in R?

Question

How to get a unique counter of identifiers by columns in R?

I have legal data that look like this. I am using RStudio.

> head(gsu[,107:117])
    HtoODay PAOSLDME DUSHD POELRD XCAB WESDF BILOE HYPERDIF IMPSENS      Billing MALLAMP
42        0     <NA>    No     No  <NA>  <NA>  <NA>       No    <NA>  Hourly      NA
61        0     <NA>    Yes    Yes <NA>   Yes  <NA>      Yes    <NA>  Hourly      NA
230       0     <NA>    No     Yes <NA>  <NA>  <NA>      Yes    <NA>  Hourly      NA
235       0     <NA>    No     No  <NA>  <NA>  <NA>      Yes    <NA>  Hourly      NA
302       0     <NA>    No     No  <NA>  <NA>   No        No    <NA>  Hourly      NA
336       3     <NA>    No     No   Yes  <NA>  <NA>       No    <NA> Consult      NA
>

I want to get the number of rows unique yes. By this, I mean that if Yes occurs in one column, it is recorded as a score of 1 regardless of the value of Yes or No of the other column.

For example, Line 61 will count 1 account Yes, although the line contains multiple Yes columns, while line 336 will also be registered in the total account as 1, given only one instance of Yes.

Essentially, how do I count unique rows of binary instances column by column without considering multiple instances within a row?

+4

r

Jason matney May 10 '15 at 4:19

source share

2 answers

rowSums(df=="Yes", na.rm=TRUE)>=1

gives

#   42    61   230   235   302   336 
#FALSE  TRUE  TRUE  TRUE FALSE  TRUE

+7

ExperimenteR May 10 '15 at 4:25

source share

akrun · Accepted Answer · 2015-05-10T05:45:00+0000

Another variant:

(1:nrow(gsu) %in% which(gsu=='Yes', arr.ind=TRUE)[,1])+0L
#[1] 0 1 1 1 0 1

or

 apply(gsu=='Yes' & !is.na(gsu), 1, any) + 0L
 #   42  61 230 235 302 336 
 #   0   1   1   1   0   1

or

 Reduce(`|`,as.data.frame(gsu=='Yes' & !is.na(gsu))) + 0L
 #[1] 0 1 1 1 0 1

or

  do.call(`pmax`, c(lapply(gsu,`==`, 'Yes'), na.rm=TRUE))
  #[1] 0 1 1 1 0 1

Benchmarks

set.seed(24)
gsu1 <- as.data.frame(matrix(sample(c(NA, 'Yes', 'No', LETTERS), 
    4000*4000, replace=TRUE), ncol=4000), stringsAsFactors=FALSE) 

akrun1 <- function() (1:nrow(gsu1) %in% which(gsu1=='Yes', 
           arr.ind=TRUE)[,1]) +0L
akrun2 <- function() do.call(`pmax`, c(lapply(gsu1, `==`, 'Yes'), 
           na.rm=TRUE))
ExperimenteR <- function() rowSums(gsu1=="Yes", na.rm=TRUE)>=1

library(microbenchmark)
microbenchmark(akrun1(), akrun2(), ExperimenteR(), unit='relative', times=20L)
 #Unit: relative
 #        expr      min       lq     mean   median       uq      max neval cld
 #     akrun1() 1.244682 1.293628 1.293696 1.294336 1.319209 1.277138    20   b
 #     akrun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20  a 
 # ExperimenteR() 1.213802 1.296464 1.276666 1.295421 1.280282 1.209436    20   b

How to get a unique counter of identifiers by columns in R?

Benchmarks

More articles: