How to count the number of non-empty fields in a delimited file?

You can count the number of fields per line in a comma / tab / in any delimited text file using utils::count.fields .

Here's a reproducible example:

 d <- data.frame( x = c(1, NA, 3, NA, 5), y = c(NA, "b", "c", NA, NA), z = c(NA, "beta", "gamma", NA, "epsilon") ) fname <- "test.csv" write.csv(d, fname, na = "", row.names = FALSE) count.fields(fname, sep = ",") ## [1] 3 3 3 3 3 3 

I want to calculate the number of non-empty fields in a row. I can do this awkwardly by reading in everything and counting the number of values ​​that are not NA .

 d2 <- read.csv(fname, na.strings = "") rowSums(!is.na(d2)) ## [1] 1 2 3 0 2 

I would really like the way a file is scanned (e.g. count.fields ), so I can target specific sections for reading.

Is there a better way to count the number of non-empty fields in a delimited file?

+6
source share
1 answer

This should be fully portable if you have Rcpp and BH packages installed:

 library(Rcpp) library(inline) csvblanks <- ' string data = as<string>(filename); ifstream fil(data.c_str()); if (!fil.is_open()) return(R_NilValue); typedef tokenizer< escaped_list_separator<char> > Tokenizer; vector<string> fields; vector<int> retval; string line; while (getline(fil, line)) { int numblanks = 0; Tokenizer tok(line); for(Tokenizer::iterator beg=tok.begin(); beg!=tok.end(); ++beg){ numblanks += (beg->length() == 0) ? 1 : 0 ; }; retval.push_back(numblanks); } return(wrap(retval)); ' count_blanks <- rcpp( signature(filename="character"), body=csvblanks, includes=c("#include <iostream>", "#include <fstream>", "#include <vector>", "#include <string>", "#include <algorithm>", "#include <iterator>", "#include <boost/tokenizer.hpp>", "using namespace Rcpp;", "using namespace std;", "using namespace boost;") ) 

After that, you can call count_blanks(FULLPATH) , and it will return the numeric vector of counting empty fields in the string.

I used it for this file:

 "DATE","APIKEY","FILENAME","LANGUAGE","JOBID","TRANSCRIPT" 1,2,3,4,5 1,,3,4,5 1,2,3,4,5 1,2,,4,5 1,2,3,4,5 1,2,3,,5 1,2,3,4,5 1,2,3,4, 1,2,3,4,5 1,,3,,5 1,2,3,4,5 ,2,,4, 1,2,3,4,5 

via:

 count_blanks("/tmp/a.csv") ## [1] 0 0 1 0 1 0 1 0 1 0 2 0 3 0 

WARNINGS

  • It is pretty obvious that it does not ignore the header, so it can use the header logical parameter with the appropriate C / C ++ code (which will be quite simple).
  • If you consider “spaces” (ie [:space:]+ ) “empty”, you will need something more complex than calling length . This is one of the possible ways to deal with it if you need to.
  • It uses the default configuration for the escaped_list_separator function escaped_list_separator , which is defined here . This can also be configured using quotation and delimiter characters (which allows you to further simulate read.csv / read.table .

This will more closely match the performance of count.fields / C_countfields and save you the C_countfields consuming memory by reading each line to find the lines that you ultimately want to achieve a more optimal goal. I don’t think that pre-allocating space for the returned numerical vector will significantly increase the speed, but you can see the discussion here , which shows how to do this, if necessary.

+6
source

All Articles