The effective value of the number of rows in big data.

I have a large data size (~ 600K rows) with a string value column (link)

doc_id,link 1,http://example.com 1,http://example.com 2,http://test1.net 2,http://test2.net 2,http://test5.net 3,http://test1.net 3,http://example.com 4,http://test5.net 

and I would like to count the number of times a certain string value occurs in a frame. The result should look like this:

 link, count http://example.com, 3 http://test1.net, 2 http://test2.net, 1 http://test5.net, 2 

Is there an efficient way to do this in R? Converting a frame to a matrix does not work due to frame size. I am currently using the plyr package, but it is too slow.

+4
source share
1 answer

The table function counts occurrences - and it is very fast compared to ddply . So maybe something like this:

 # some sample data set.seed(42) df <- data.frame(doc_id=1:10, link=sample(letters[1:3], 10, replace=TRUE)) cnt <- as.data.frame(table(df$link)) # Assign appropriate names (optional) names(cnt) <- c("link", "count") cnt 

Which gives the following conclusion:

  link count 1 a 2 2 b 3 3 c 5 
+5
source

All Articles