This should be fully portable if you have Rcpp and BH packages installed:
library(Rcpp) library(inline) csvblanks <- ' string data = as<string>(filename); ifstream fil(data.c_str()); if (!fil.is_open()) return(R_NilValue); typedef tokenizer< escaped_list_separator<char> > Tokenizer; vector<string> fields; vector<int> retval; string line; while (getline(fil, line)) { int numblanks = 0; Tokenizer tok(line); for(Tokenizer::iterator beg=tok.begin(); beg!=tok.end(); ++beg){ numblanks += (beg->length() == 0) ? 1 : 0 ; }; retval.push_back(numblanks); } return(wrap(retval)); ' count_blanks <- rcpp( signature(filename="character"), body=csvblanks, includes=c("#include <iostream>", "#include <fstream>", "#include <vector>", "#include <string>", "#include <algorithm>", "#include <iterator>", "#include <boost/tokenizer.hpp>", "using namespace Rcpp;", "using namespace std;", "using namespace boost;") )
After that, you can call count_blanks(FULLPATH) , and it will return the numeric vector of counting empty fields in the string.
I used it for this file:
"DATE","APIKEY","FILENAME","LANGUAGE","JOBID","TRANSCRIPT" 1,2,3,4,5 1,,3,4,5 1,2,3,4,5 1,2,,4,5 1,2,3,4,5 1,2,3,,5 1,2,3,4,5 1,2,3,4, 1,2,3,4,5 1,,3,,5 1,2,3,4,5 ,2,,4, 1,2,3,4,5
via:
count_blanks("/tmp/a.csv")
WARNINGS
- It is pretty obvious that it does not ignore the header, so it can use the
header logical parameter with the appropriate C / C ++ code (which will be quite simple). - If you consider “spaces” (ie
[:space:]+ ) “empty”, you will need something more complex than calling length . This is one of the possible ways to deal with it if you need to. - It uses the default configuration for the
escaped_list_separator function escaped_list_separator , which is defined here . This can also be configured using quotation and delimiter characters (which allows you to further simulate read.csv / read.table .
This will more closely match the performance of count.fields / C_countfields and save you the C_countfields consuming memory by reading each line to find the lines that you ultimately want to achieve a more optimal goal. I don’t think that pre-allocating space for the returned numerical vector will significantly increase the speed, but you can see the discussion here , which shows how to do this, if necessary.