How to read information from text files?

I have hundreds of text files with the following information in each file:

*****Auto-Corelation Results****** 1 .09 -.19 .18 non-Significant *****STATISTICS FOR MANN-KENDELL TEST****** S= 609 VAR(S)= 162409.70 Z= 1.51 Random : No trend at 95% *****SENs STATISTICS ****** SEN SLOPE = .24 

Now I want to read all these files and "collect" Sen statistics from each file (for example, .24 ) and compile them into one file along with the corresponding file names. I have to do it in R.

I worked with CSV files but not sure how to use text files.

This is the code I'm using now:

 require(gtools) GG <- grep("*.txt", list.files(), value = TRUE) GG<-mixedsort(GG) S <- sapply(seq(GG), function(i){ X <- readLines(GG[i]) grep("SEN SLOPE", X, value = TRUE) }) spl <- unlist(strsplit(S, ".*[^.0-9]")) SenStat <- as.numeric(spl[nzchar(spl)]) SenStat<-data.frame( SenStat,file = GG) write.table(SenStat, "sen.csv",sep = ", ",row.names = FALSE) 

The current code cannot correctly read all the values ​​and give this error:

 Warning message: NAs introduced by coercion 

Also, I do not get the file names in another Output column. Please, help!


Diagnostics 1

The code also reads the = sign. This is print (spl) output

  [1] "" "5.55" "" "-.18" "" "3.08" "" "3.05" "" "1.19" "" "-.32" [13] "" ".22" "" "-.22" "" ".65" "" "1.64" "" "2.68" "" ".10" [25] "" ".42" "" "-.44" "" ".49" "" "1.44" "" "=-1.07" "" ".38" [37] "" ".14" "" "=-2.33" "" "4.76" "" ".45" "" ".02" "" "-.11" [49] "" "=-2.64" "" "-.63" "" "=-3.44" "" "2.77" "" "2.35" "" "6.29" [61] "" "1.20" "" "=-1.80" "" "-.63" "" "5.83" "" "6.33" "" "5.42" [73] "" ".72" "" "-.57" "" "3.52" "" "=-2.44" "" "3.92" "" "1.99" [85] "" ".77" "" "3.01" 

Diagnosis 2

Found a problem, I think. The negative sign is a little more complicated. In some files this

 SEN SLOPE =-1.07 SEN SLOPE = -.11 

Due to the space after =, I get NA for the first, but the code reads the second. How can I change the regex to fix this? Thank you

+6
source share
4 answers

Suppose that "text.txt" is one of your text files. Read in R with readLines , you can use grep to find a line containing SEN SLOPE . Without additional arguments, grep returns the index number (s) for the element in which the regular expression was found. Here we find that this is the 11th line. Add value = TRUE to get the row as it reads.

 x <- readLines("text.txt") grep("SEN SLOPE", x) ## [1] 11 ( gg <- grep("SEN SLOPE", x, value = TRUE) ) ## [1] "SEN SLOPE = .24" 

To find all .txt files in the working directory, we can use list.files with a regular expression.

 list.files(pattern = "*.txt") ## [1] "text.txt" 

RECORDING ON SEVERAL FILES

I created a second text file text2.txt with a different SEN SLOPE value to illustrate how I can apply this method to multiple files. We can use sapply and then strsplit to get the required spl values.

 GG <- list.files(pattern = "*.txt") S <- sapply(seq_along(GG), function(i){ X <- readLines(GG[i]) ifelse(length(X) > 0, grep("SEN SLOPE", X, value = TRUE), NA) ## added 04/23/14 to account for empty files (as per comment) }) spl <- unlist(strsplit(S, split = ".*((=|(\\s=))|(=\\s|\\s=\\s))")) ## above regex changed to capture up to and including "=" and ## surrounding space, if any - 04/23/14 (as per comment) SenStat <- as.numeric(spl[nzchar(spl)]) 

Then we can put the results in a data frame and send it to a file using write.table

 ( SenStatDf <- data.frame(SenStat, file = GG) ) ## SenStat file ## 1 0.46 text2.txt ## 2 0.24 text.txt 

We can write it to a file using

 write.table(SenStatDf, "myFile.csv", sep = ", ", row.names = FALSE) 

UPDATED 07/21/2014:

Since the result is written to a file, this can be made much simpler (and faster) with

 ( SenStatDf <- cbind( SenSlope = c(lapply(GG, function(x){ y <- readLines(x) z <- y[grepl("SEN SLOPE", y)] unlist(strsplit(z, split = ".*=\\s+"))[-1] }), recursive = TRUE), file = GG ) ) # SenSlope file # [1,] ".46" "test2.txt" # [2,] ".24" "test.txt" 

And then it is written and read in R with

 write.table(SenStatDf, "myFile.txt", row.names = FALSE) read.table("myFile.txt", header = TRUE) # SenSlope file # 1 1.24 test2.txt # 2 0.24 test.txt 
+10
source

First create a sample text file:

 cat('*****Auto-Corelation Results****** 1 .09 -.19 .18 non-Significant *****STATISTICS FOR MANN-KENDELL TEST****** S= 609 VAR(S)= 162409.70 Z= 1.51 Random : No trend at 95% *****SENs STATISTICS ****** SEN SLOPE = .24',file='samp.txt') 

Then read it in:

 tf <- readLines('samp.txt') 

Now extract the appropriate line:

 sen_text <- grep('SEN SLOPE',tf,value=T) 

And then get the value for the equal sign:

 sen_value <- as.numeric(unlist(strsplit(sen_text,'='))[2]) 

Then combine these results for each of your files (without the file structure mentioned in the original question)

+4
source

If text files always have this format (for example, Sen Slope is always on line 11) and the text is identical to all your files, you can do what you need in just two lines.

 char_vector <- readLines("Path/To/Document/sample.txt") statistic <- as.numeric(strsplit(char_vector[11]," ")[[1]][5]) 

This will give you 0.24.

Then you iterate over all your files using the application statement or the for loop.

For clarity:

 > char_vector[11] [1] "SEN SLOPE = .24" 

and

 > strsplit(char_vector[11]," ") [[1]] [1] "SEN" "SLOPE" "=" "" ".24" 

So you want [[1]] [5] the result from strsplit.

+1
source

Step 1: save the full fileNames in one variable:

 fileNames <- dir(dataDir,full.names=TRUE) 

Step 2: Allows you to read and process one of the files and ensure that it gives the correct results:

 data.frame( file=basename(fileNames[1]), SEN.SLOPE= as.numeric(tail( strsplit(grep('SEN SLOPE',readLines(fileNames[1]),value=T),"=")[[1]],1)) ) 

Step 3: do it on all fileNames

 do.call( rbind, lapply(fileNames, function(fileName) data.frame( file=basename(fileName), SEN.SLOPE= as.numeric(tail( strsplit(grep('SEN SLOPE', readLines(fileName),value=T),"=")[[1]],1) ) ) ) ) 

Hope this helps!

+1
source

All Articles