How to get the number of rows for a specific value in a column

I am trying to get the row count for a specific column. I have three columns with name, age and major. How can I find out how many of the BIO majors are, for example, from this list.

I have DF <- (NAME, YEAR, MAJOR, GPA) I want to have a function, so I can eliminate any major with less than 20 people.

so i want something like this, but in real code r.

DF <- function(x){ ## Y <- get number of people for each major ## GPA [DF$Y < 20] <- NA 

Any help would be appreciated

+8
r
source share
3 answers

I think the two proposed methods are too complicated. Try any of them, the second of which is obviously the β€œright way.” :-) (Borrowing an @gung example.)

 # 1 > tapply( DF$MAJOR, DF$MAJOR, length) BIO ECON HIST LIT MATH 181 155 297 303 64 # 2 > table(DF$MAJOR) BIO ECON HIST LIT MATH 181 155 297 303 64 And as far as efficiency? > system.time( {dt = data.table(DF) + foo <- dt[,.N,by=MAJOR] }) user system elapsed 1.384 0.027 1.417 > system.time(foo<- table(DF$MAJOR) ) user system elapsed 0.110 0.025 0.134 #edit: > system.time( {dt = as.data.table(DF) + foo <- dt[,.N,by=MAJOR] }) user system elapsed 0.064 0.022 0.086 

The answer to the question is in the comments on how to associate a tabular result with each student record, look at the ave function and use the first method with "[" extraction or using subset :

  DF$group.size <- ave(DF$MAJOR, DF$MAJOR, length) newDF <- DF[ DF$group.size >=20000 , ] 
+11
source share

Again, this is a function to group .table data packets for recovery. There is a β€œ.N” notation, which means the number of lines in each group and gives you exactly what you need. Borrowing from the previous answer:

 > N = 1000 > set.seed(2) > dt <- data.table(NAME=as.character(1:N), + YEAR=sample(c("Freshman","Sophomore","Junior","Senior"), + size=N, replace=T), + MAJOR=sample(c("BIO","ECON","HIST","LIT","MATH"),size=N, + replace=T, prob=c(.20, .15, .30, .30, .05)), + GPA=runif(N, min=0, max=4)) > dt[,.N,by=MAJOR] MAJOR N 1: HIST 297 2: LIT 303 3: BIO 181 4: ECON 155 5: MATH 64 

So now it is single line. And this is also fast (using N = 1,000,000):

 > system.time( foo <- cbind(levels(unique(DF$MAJOR)), + lapply(unique(DF$MAJOR), function(x){ sum(DF$MAJOR==x) })) ) user system elapsed 0.616 0.050 0.665 > dt = data.table(DF) > system.time( foo <- dt[,.N,by=MAJOR] ) user system elapsed 0.039 0.002 0.042 
+2
source share

The main way of counting the amount of what you have is to sum the logical vector, where each element of the logical vector is 1 if the source element is the thing you want to count, or 0 otherwise,

Let's start with some data:

 N = 1000 set.seed(2) DF <- data.frame(NAME=as.character(1:N), YEAR=sample(c("Freshman","Sophomore","Junior","Senior"), size=N, replace=T), MAJOR=sample(c("BIO","ECON","HIST","LIT","MATH"),size=N, replace=T, prob=c(.20, .15, .30, .30, .05)), GPA=runif(N, min=0, max=4)) 

This way we find out how many BIO majors you have:

 sum(DF$MAJOR=="BIO") [1] 181 

If you want to know how much you have for each existing major, can you get a list of majors with ? unique and then apply the function above to the list with ? lapply :

 lapply(unique(DF$MAJOR), function(x){ sum(DF$MAJOR==x) }) 

Here's a slightly more beautiful version:

 cbind(levels(unique(DF$MAJOR)), lapply(unique(DF$MAJOR), function(x){ sum(DF$MAJOR==x) })) [,1] [,2] [1,] "BIO" 297 [2,] "ECON" 303 [3,] "HIST" 181 [4,] "LIT" 155 [5,] "MATH" 64 

You should be able to take it from here.


Update: @DWin is right, I did it too hard. Since DF$MAJOR is a factor, you can simply do:

 > summary(DF$MAJOR) BIO ECON HIST LIT MATH 181 155 297 303 64 
+1
source share

All Articles