How to read a text file in R when each record is a paragraph, and some records have 4 fields and others have 6

As you can read in a text file, in which each entry is a paragraph, and each new line denotes a separate field. The complication is that some entries have 4 lines, and some have 6. @DWin beat my questions when the difference in the number of fields is 1, but it all fell apart when there were two. You can find his answer here .

So here is my last start text simulation

TheInstitute 5467 telephone line 4125526987 x 4567 datetime 2011110516 12:56 blay blay blah who knows what, but anyway it may have a comma TheInstitute 5467 telephone line 4125526987 x 4567 datetime 2011110516 12:58 blay blay blah who knows what TheInstitute 5467 telephone line 412552999 x 4999 bump phone line 4125527777 bump pony pony oops 4125527777 datetime 2011110516 12:59 blay blay blah who knows what TheInstitute 5467 telephone line 4125526987 x 4567 bump phone line 4125527777 bump pony pony oops 4125527777 datetime 2011110516 13:51 blay blay blah who knows what, but anyway it may have a comma TheInstitute 5467 telephone line 4125526987 x 4567 datetime 2011110516 14:56 blay blay blah who knows what 

This is what the output should look like. This is actually one step away from what I need. I put the ASCII text representation of R data.frame below. You will see that everything is in the data frame, but the field values ​​are shifted by two columns, because some records have two additional fields.

 structure(list(institution = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "TheInstitute 5467", class = "factor"), telephoneline = structure(c(1L, 1L, 2L, 1L, 1L), .Label = c("telephone line 4125526987 x 4567", "telephone line 412552999 x 4999"), class = "factor"), date.or.bump = structure(c(2L, 3L, 1L, 1L, 4L), .Label = c("bump phone line 4125527777", "datetime 2011110516 12:56", "datetime 2011110516 12:58", "datetime 2011110516 14:56"), class = "factor"), field4 = structure(c(2L, 1L, 3L, 3L, 1L), .Label = c("blay blay blah who knows what", "blay blay blah who knows what, but anyway it may have a comma", "bump pony pony oops 4125527777"), class = "factor"), field5 = structure(c(1L, 1L, 2L, 3L, 1L), .Label = c("", "datetime 2011110516 12:59", "datetime 2011110516 13:51"), class = "factor"), field6 = structure(c(1L, 1L, 2L, 3L, 1L), .Label = c("", "blay blay blah who knows what", "blay blay blah who knows what, but anyway it may have a comma" ), class = "factor")), .Names = c("institution", "telephoneline", "date.or.bump", "field4", "field5", "field6"), class = "data.frame", row.names = c(NA, -5L)) 

PS: Do I believe that one sends a data frame using dput or can save the .Rdata file here.

+7
source share
3 answers

This may be a more elegant way, but this should do the job.

 x <- readLines("foo.txt") # read data with readLines nx <- !nchar(x) # locate lines with only empty strings # create a list (split by empty lines, with empty lines removed) y <- split(x[!nx], cumsum(nx)[!nx]) # determine largest number of columns maxLength <- max(sapply(y,length)) # pad each list element with empty strings z <- lapply(y, function(x) c(x,rep("",maxLength-length(x)))) # create final matrix out <- do.call(rbind, z) 

Update:

Here is another solution using plyr::rbind.fill :

 x <- readLines("foo.txt") # read data with readLines nx <- !nchar(x) # locate lines with only empty strings # create final data.frame out <- rbind.fill(lapply(split(x[!nx], cumsum(nx)[!nx]), function(x) data.frame(t(x)))) 
+9
source

Another strategy is to use your chosen line β€” name it EOL β€” to mark the end of each line, and then insert all the lines together.

Then you can use two rounds of strsplit for the first , and then infer fields from the records. (Entries will be separated by two consecutive EOL s, and fields will be separated by one EOL ).

 EOL <- " !@ " # (for instance) x <- readLines("filename.R") x <- paste(x, collapse=EOL)[[1]] x <- strsplit(x, paste(EOL, EOL, sep="")) # Split apart records lapply(x, FUN=function(X) strsplit(X, EOL))[[1]] # Split apart fields w/in records 

This method appeals to me because it is close to what I would like to do when I first read in the file (ie use "\n\n" as the sep character), but I can not do this either scan , either readLines .

+5
source

Reading data. dat <- readLines ("filename.txt")

Separate record data (based on decision by Josh O'Brien)

 dat_rec <- lapply(strsplit(paste(dat,collapse="\n"),split="\n\n")[[1]], function(x) strsplit(x,split="\n")[[1]]) 

Convert data to named vectors (suppose the last field is a comment, and the data starts with a numeric value)

 dat_rec_vn <- lapply(dat_rec,function(x) { vn <- gsub(" ","_",sub(" ","", gsub("^(\\D*) \\d.*$","\\1", x[-length(x)]))) y <- gsub("^(\\D*) (\\d.*)$","\\2",x[-length(x)]) names(y) <- vn return(y)}) 

Get unique field names in the data.

  vn <- unique(unlist(lapply(dat_rec_vn,names),use.names=FALSE)) 

Combine the field into a matrix and give it a name.

  dat_mat <- do.call(rbind,lapply(dat_rec_vn,function(x) { y <- vector(mode="character",length=length(vn)) y[match(names(x),vn)] <- x return(y)})) colnames(dat_mat) <- vn 

SECOND solution (using gawk)

 gawk_cmd <- "gawk 'BEGIN{FS=\"\\n\";RS=\"\";OFS=\"\\t\";ORS=\"\\n\"} {$1=$1; print $0}' test_multi.txt" dat <- strsplit(system(gawk_cmd,intern=TRUE),split="\t") NF <- do.call(max,lapply(dat,length)) M <- do.call(rbind,lapply(dat,"[",seq(NF))) 
+2
source

All Articles