Extract data from text files

Other languages ​​seem to have similar questions, but I can't find them in R.

I have several text files in subdirectories of a directory; they all have an extension (.log) and they contain a mixture of text and data. I want to extract a couple of lines from these relatively large files.

For example, one file looks like this:

blahblahblah NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS = 210 blahblahblah ----------------------------------------<br /> CPU timing information for all processes<br /> ========================================<br /> 0: 8853.469 + 133.948 = 8987.417<br /> 1: 8850.817 + 126.587 = 8977.405<br /> 2: 8851.925 + 128.576 = 8980.501<br /> 3: 8847.992 + 125.871 = 8973.864<br /> ----------------------------------------<br /> ddikick.x: exited gracefully.<br /> blahblahblah 

I want to collect the number of basic functions (210 in this example) and the total number of processor times.

The line "NUMBER OF FUNCTIONS OF FUNCTIONS OF THE CARTUSIAN GAUSIAN BASES =" is unique for each file; those. if I open the file in a text editor and search using this line, I will return this only one line. Similarly for "processor synchronization information for all processes" and "gracefully exits."

I appreciate that it seems that I haven’t done much to help myself, but I just don’t know where to start. If someone can point me in the right direction, I hope I can fill in the rest.

After the help provided to me by @Ben (see below), here is the code I used,

 filesearch <- function (x) { f <- readLines(x) cline <- grep("NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS",f, value=TRUE) val <- as.numeric(str_extract(cline,"[0-9]+$")) coline <- grep("^ +CPU timing information", f) numstr <- sapply(str_extract_all(f[coline+2:5],"[0-9.]+"),as.numeric) cline1 <- sum(numstr[4,])/60 output <- c(val, cline1) return(cat(output,"\n")) } 

I got this function and entered the key into the file that I need every time, and then transferred the two results to another file manually. Not as elegant as we would like, but it saved me a lot of time doing it this way. Thanks again @Ben.

+6
source share
1 answer

may be

 library(stringr) f <- readLines("datafile.txt") cline <- grep("NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS",f, value=TRUE) val <- as.numeric(str_extract(cline,"[0-9]+$")) 

will work?

To get other values ​​try

 cline <- grep("^ +CPU timing information",f) (numstr <- sapply(str_extract_all(f[cline+2:5],"[0-9.]+"),as.numeric)) ## [,1] [,2] [,3] [,4] ## [1,] 0.000 1.000 2.000 3.000 ## [2,] 8853.469 8850.817 8851.925 8847.992 ## [3,] 133.948 126.587 128.576 125.871 ## [4,] 8987.417 8977.405 8980.501 8973.864 

sapply carry the matrix of values, so the last row is the bit we want (corresponds to the last column in the file). Extract it using numstr[4,] or numstr[nrow(numstr),] or tail(numstr,1) .

( change ): allow spaces before the line "CPU synchronization") ( change : do it right!)

(To do this for all log files, package it in a function and use list.files(pattern="\\.log$") in combination with sapply ...)

+6
source

All Articles