How do you parse a sysmon file to extract certain information using R?

I am trying to read a bunch of files of this type with R to parse the information and put the data in the data frame as a format:

this is the contents of the file:

last_run current_run seconds ------------------------------- ------------------------------- ----------- Jul 4 2016 7:17AM Jul 4 2016 7:21AM 226 Engine Utilization (Tick %) User Busy System Busy I/O Busy Idle ------------------------- ------------ ------------ ---------- ---------- ThreadPool : syb_default_pool Engine 0 5.0 % 0.4 % 22.4 % 72.1 % Engine 1 3.9 % 0.5 % 22.8 % 72.8 % Engine 2 5.6 % 0.3 % 22.5 % 71.6 % Engine 3 5.1 % 0.4 % 22.7 % 71.8 % ------------------------- ------------ ------------ ---------- ---------- Pool Summary Total 336.1 % 25.6 % 1834.6 % 5803.8 % Average 4.2 % 0.3 % 22.9 % 72.5 % ------------------------- ------------ ------------ ---------- ---------- Server Summary Total 336.1 % 25.6 % 1834.6 % 5803.8 % Average 4.2 % 0.3 % 22.9 % 72.5 % Transaction Profile ------------------- Transaction Summary per sec per xact count % of total ------------------------- ------------ ------------ ---------- ---------- Committed Xacts 137.3 n/a 41198 n/a Average Runnable Tasks 1 min 5 min 15 min % of total ------------------------- ------------ ------------ ---------- ---------- ThreadPool : syb_default_pool Global Queue 0.0 0.0 0.0 0.0 % Engine 0 0.0 0.1 0.1 0.6 % Engine 1 0.0 0.0 0.0 0.0 % Engine 2 0.2 0.1 0.1 2.6 % ------------------------- ------------ ------------ ---------- Pool Summary Total 7.2 5.9 6.1 Average 0.1 0.1 0.1 ------------------------- ------------ ------------ ---------- Server Summary Total 7.2 5.9 6.1 Average 0.1 0.1 0.1 Device Activity Detail ---------------------- Device: /dev/vx/rdsk/sybaserdatadg/datadev_125 datadev_125 per sec per xact count % of total ------------------------- ------------ ------------ ---------- ---------- Total I/Os 0.0 0.0 0 n/a ------------------------- ------------ ------------ ---------- ---------- Total I/Os 0.0 0.0 0 0.0 % ----------------------------------------------------------------------------- Device: /dev/vx/rdsk/sybaserdatadg/datadev_126 datadev_126 per sec per xact count % of total ------------------------- ------------ ------------ ---------- ---------- Total I/Os 0.0 0.0 0 n/a ------------------------- ------------ ------------ ---------- ---------- Total I/Os 0.0 0.0 0 0.0 % ----------------------------------------------------------------------------- Device: /dev/vx/rdsk/sybaserdatadg/datadev_127 datadev_127 per sec per xact count % of total ------------------------- ------------ ------------ ---------- ---------- Reads APF 0.0 0.0 5 0.4 % Non-APF 0.0 0.0 1 0.1 % Writes 3.8 0.0 1128 99.5 % ------------------------- ------------ ------------ ---------- ---------- Total I/Os 3.8 0.0 1134 0.1 % Mirror Semaphore Granted 3.8 0.0 1134 100.0 % Mirror Semaphore Waited 0.0 0.0 0 0.0 % ----------------------------------------------------------------------------- Device: /sybaser/database/sybaseR/dev/sybaseR.datadev_000 GPS_datadev_000 per sec per xact count % of total ------------------------- ------------ ------------ ---------- ---------- Reads APF 7.9 0.0 2372 55.9 % Non-APF 5.5 0.0 1635 38.6 % Writes 0.8 0.0 233 5.5 % ------------------------- ------------ ------------ ---------- ---------- Total I/Os 14.1 0.0 4240 0.3 % Mirror Semaphore Granted 14.1 0.0 4239 100.0 % Mirror Semaphore Waited 0.0 0.0 2 0.0 % 

I need to write “July 4, 2016 7:21” as a date, from the section “Using Hardware (Tick%)”, “Server Summary” → “4.2%”

In the "Transaction Profile" → "Account" section of the transaction report.

So my data frame should look something like this:

 Date Cpu Count Jul 4 2016 7:21AM 4.2 41198 

Can someone help me parse this file to get this output?

I tried something like this:

 read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9]) 

to get this line:

 Average 4.2 % 0.3 % 22.9 % 72.5 % 

But I want to be able to extract only Average right after

Using a computer (Tick%), as there can be many lines starting with an average. The middle line displayed immediately after using the utility (Tick%) is the one I want.

How to put this on this line to extract this information from this file:

 read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9]) 

Is it possible to use grep in this read.table line to search for specific characters?

+8
regex r parsing pcre
source share
4 answers

%%%% Shot 1 - Got something working

 extract <- function(filenam="file.txt"){ txt <- readLines(filenam) ## date of current run: ## assumed to be on 2nd line following the first line matching "current_run" ii <- 2 + grep("current_run",txt, fixed=TRUE)[1] line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]]) date_current_run <- paste(line_current_run[5:8], collapse=" ") ## Cpu: ## assumed to be on line following the first line matching "Server Summary" ## which comes after the first line matching "Engine Utilization ..." jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1] ii <- grep("Server Summary",txt, fixed=TRUE) ii <- 1 + min(ii[ii>jj]) line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]]) Cpu <- line_Cpu[2] ## Count: ## assumed to be on 2nd line following the first line matching "Transaction Summary" ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1] line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]]) count <- line_count[5] data.frame(Date=date_current_run, Cpu=Cpu, Count=count, stringsAsFactors=FALSE) } print(extract("file.txt")) ##file.list <- dir("./") file.list <- rep("file.txt",3) merged <- do.call("rbind", lapply(file.list, extract)) print(merged) file.list <- rep("file.txt",2000) print(system.time(merged <- do.call("rbind", lapply(file.list, extract)))) ## runs in about 2.5 secs on my laptop 

%%% Shot 2: first attempt to extract (potentially variable) number of device columns

 extractv2 <- function(filenam="file2.txt"){ txt <- readLines(filenam) ## date of current run: ## assumed to be on 2nd line following the first line matching "current_run" ii <- 2 + grep("current_run",txt, fixed=TRUE)[1] line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]]) date_current_run <- paste(line_current_run[5:8], collapse=" ") ## Cpu: ## assumed to be on line following the first line matching "Server Summary" ## which comes after the first line matching "Engine Utilization ..." jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1] ii <- grep("Server Summary",txt, fixed=TRUE) ii <- 1 + min(ii[ii>jj]) line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]]) Cpu <- line_Cpu[2] ## Count: ## assumed to be on 2nd line following the first line matching "Transaction Summary" ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1] line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]]) count <- line_count[5] ## Total I/Os ## 1. Each line "Device:" is assumed to be the header of a block of lines ## containing info about a single device (there are 4 such blocks ## in your example); ## 2. each block is assumed to contain one or more lines matching ## "Total I/Os"; ## 3. the relevant count data is assumed to be contained in the last ## of such lines (at column 4), for each block. ## Approach: loop on the line numbers of those lines matching "Device:" ## to get: A. counts; B. device names ii_block_dev <- grep("Device:", txt, fixed=TRUE) ii_lines_IOs <- grep("Total I/Os", txt, fixed=TRUE) nblocks <- length(ii_block_dev) ## A. get counts for each device ## for each block, select *last* line matching "Total I/Os" ii_block_dev_aux <- c(ii_block_dev, Inf) ## just a hack to get a clean code ii_lines_IOs_dev <- sapply(1:nblocks, function(block){ ## select matching liens to "Total I/Os" within each block IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block ] & ii_lines_IOs < ii_block_dev_aux[block+1] ] tail(IOs_per_block, 1) ## get the last line of each block (if more than one match) }) lines_IOs <- lapply(txt[ii_lines_IOs_dev], function(strng){ Filter(function(v) v!="", strsplit(strng," ")[[1]]) }) IOs_counts <- sapply(lines_IOs, function(v) v[5]) ## B. get device names: ## assumed to be on lines following each "Device:" match ii_devices <- 1 + ii_block_dev device_names <- sapply(ii_devices, function(ii){ Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]]) }) ## Create a data.frame with "device_names" as column names and "IOs_counts" as ## the values of a single row. ## Sorting the device names by order() will help produce the same column names ## if different sysmon files list the devices in different order ord <- order(device_names) devices <- as.data.frame(structure(as.list(IOs_counts[ord]), names=device_names[ord]), check.names=FALSE) ## Prevent R from messing with our device names data.frame(stringsAsFactors=FALSE, check.names=FALSE, Date=date_current_run, Cpu=Cpu, Count=count, devices) } print(extractv2("file2.txt")) ## WATCH OUT: ## merging will ONLY work if all devices have the same names across sysmon files!! file.list <- rep("file2.txt",3) merged <- do.call("rbind", lapply(file.list, extractv2)) print(merged) 

%%%%%%%%% Shot 3: extract two tables, one with one row and the second with a variable number of rows (depending on which devices are listed in each sysmon file).

 extractv3 <- function(filenam="file2.txt"){ txt <- readLines(filenam) ## date of current run: ## assumed to be on 2nd line following the first line matching "current_run" ii <- 2 + grep("current_run",txt, fixed=TRUE)[1] line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]]) date_current_run <- paste(line_current_run[5:8], collapse=" ") ## Cpu: ## assumed to be on line following the first line matching "Server Summary" ## which comes after the first line matching "Engine Utilization ..." jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1] ii <- grep("Server Summary",txt, fixed=TRUE) ii <- 1 + min(ii[ii>jj]) line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]]) Cpu <- line_Cpu[2] ## Count: ## assumed to be on 2nd line following the first line matching "Transaction Summary" ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1] line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]]) count <- line_count[5] ## first part of output: fixed three-column structure fixed <- data.frame(stringsAsFactors=FALSE, Date=date_current_run, Cpu=Cpu, Count=count) ## Total I/Os ## 1. Each line "Device:" is assumed to be the header of a block of lines ## containing info about a single device (there are 4 such blocks ## in your example); ## 2. each block is assumed to contain one or more lines matching ## "Total I/Os"; ## 3. the relevant count data is assumed to be contained in the last ## of such lines (at column 4), for each block. ## Approach: loop on the line numbers of those lines matching "Device:" ## to get: A. counts; B. device names ii_block_dev <- grep("Device:", txt, fixed=TRUE) if(length(ii_block_dev)==0){ variable <- data.frame(stringsAsFactors=FALSE, date_current_run=date_current_run, device_names=NA, IOs_counts=NA) }else{ ii_lines_IOs <- grep("Total I/Os", txt, fixed=TRUE) nblocks <- length(ii_block_dev) if(length(ii_block_dev)==0){ sprintf("WEIRD datapoint at date %s: I have %d devices but 0 I/O lines??") ##stop() } ## A. get counts for each device ## for each block, select *last* line matching "Total I/Os" ii_block_dev_aux <- c(ii_block_dev, Inf) ## just a hack to get a clean code ii_lines_IOs_dev <- sapply(1:nblocks, function(block){ ## select matching lines to "Total I/Os" within each block IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block ] & ii_lines_IOs < ii_block_dev_aux[block+1] ] tail(IOs_per_block, 1) ## get the last line of each block (if more than one match) }) lines_IOs <- lapply(txt[ii_lines_IOs_dev], function(strng){ Filter(function(v) v!="", strsplit(strng," ")[[1]]) }) IOs_counts <- sapply(lines_IOs, function(v) v[5]) ## B. get device names: ## assumed to be on lines following each "Device:" match ii_devices <- 1 + ii_block_dev device_names <- sapply(ii_devices, function(ii){ Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]]) }) ## Create a data.frame with three columns: date, device, counts variable <- data.frame(stringsAsFactors=FALSE, date_current_run=rep(date_current_run, length(IOs_counts)), device_names=device_names, IOs_counts=IOs_counts) } list(fixed=fixed, variable=variable) } print(extractv3("file2.txt")) file.list <- c("file.txt","file2.txt","file3.txt") res <- lapply(file.list, extractv3) fixed.merged <- do.call("rbind", lapply(res, function(r) r$fixed)) print(fixed.merged) variable.merged <- do.call("rbind", lapply(res, function(r) r$variable)) print(variable.merged) 
+2
source share

Manipulating text files can sometimes be easier with dedicated programs. For example. gawk specifically designed to search for patterns in text files and output data from them. We can use a short gawk script to get the required data to load in R. Note: each line of the script consists of a search pattern, followed by an action that needs to be enclosed in {} . NR is a counter that counts the number of lines read so far.

 BEGIN {OFS = ""; ORS = ""} /current_run/ {dat_line = NR+2; cpu_done = 0} /Server Summary/ {cpu_line = NR+1} /Transaction Summary/ {cnt_line = NR+2} NR == dat_line {print "'",$5," ",$6," ",$7," ",$8,"' "} NR == cpu_line && cpu_done==0 {print $2," "; cpu_done = 1} NR == cnt_line {print $5,"\n"} 

Save this script with the name "ext.awk", then extract all the data files into the R data frame (provided that they are all located in the same folder and have the extension .txt ) with

 df <- read.table(text=system("gawk -f ext.awk *.txt", T), col.names = c("Date","Cpu","Count")) 

NB, gawk is ready to install on most versions of Linux. In windows you may need to install it from http://gnuwin32.sourceforge.net/packages/gawk.htm

+2
source share

For reading files Here I assume CSV as a file type. For others, visit http://www.r-tutor.com/r-introduction/data-frame/data-import

 >utilization <- read.csv(file="",head=TRUE) >serverSummary <-read.csv(file="",head=TRUE) >transcProfile <- read.csv(file="",head=TRUE) 

==> merge accepts only two arguments

 >data <- merge(utilization,serverSummary) >dataframe <-merge(data,transcProfile) 

you will now have all the columns in the dataframe

 >dataframe 

u can see all columns in dataframe

Extend columns on demand ==> The subset function () is the easiest way to select variables and observations

 >subset(dataframe,select=c("last_run","Average","Transaction Profile") 

Now you can write it to CSV or any type of file

 >write.csv(dataframe, file = "MyData.csv") 

To merge all files

 multmerge = function(mypath){ filenames=list.files(path=mypath, full.names=TRUE) datalist = lapply(filenames, function(x){read.csv(file=x,header=T)}) Reduce(function(x,y) {merge(x,y)}, datalist) 

After you run the code to define the function, you are all set to use it. The function takes a path. This path should be the name of the folder containing all the files that you want to read and merge together, and only those files that you would like to merge. With that in mind, I have two tips:

Before using this function, I want to create a new folder in a short directory (for example, the path for this folder may be "C: // R // mergeme") and save all the files that you would like to merge into this folder. Also, make sure that the column that will perform the mapping will be formatted the same way (and will have the same name) in each file. Suppose you saved 20 files in the mergeme folder in the "C: // R // mergeme" folder, and you would like to read and merge them. To use my function, you use the following syntax:

 mymergeddata = multmerge("C://R//mergeme") 

After executing this command, you have a fully integrated data frame with all of your variables corresponding to each other

Now you can multiply the data according to the required columns.

0
source share

Use readLines or stringi::stri_read_lines to read the contents of the file as a character vector. The latter is usually faster, but not so mature, and sometimes violates unusual content.

 lines <- readLines("the file name") 

For quick regular matching to an expression, stringi usually your best bet. rebus.datetimes allows you to generate a regular expression from the strptime date format strptime .

Search for current due date

The line where current_run appears is found with:

 library(stringi) library(rebus.datetimes) i_current_run <- which(stri_detect_fixed(lines, "current_run")) 

To extract the dates, this code only looks at the 2nd line after the one where the current run is performed, but the code is vectorizable, so you can easily view all the lines if you have files in which this assumption is not fulfilled.

 date_format <- "%b%t%d%t%Y%t%H:%M%p" rx_date <- rebus.datetimes::datetime(date_format, io = "input") extracted_dates <- stri_extract_all_regex(lines[i_current_run + 2], rx_date) current_run_date <- strptime( extracted_dates[[1]][2], date_format, tz = "UTC" ) ## [1] "2016-07-04 07:21:00 UTC" 

Search% busy user

The "Engine Use" section is located through

 i_engine_util <- which( stri_detect_fixed(lines, "Engine Utilization (Tick %)") ) 

We need the first instance of the Server Summary that appears after this line.

 i_server_summary <- i_engine_util + min(which( stri_detect_fixed(lines[(i_engine_util + 1):n_lines], "Server Summary") )) 

Use a regular expression to extract a number from the next line.

 user_busy <- as.numeric( stri_extract_first_regex(lines[i_server_summary + 1], "[0-9]+(?:\\.[0-9])") ) ## [1] 4.2 

Finding the number of committed xacts

Committed Xacts Line

 i_comm_xacts <- which(stri_detect_fixed(lines, "Committed Xacts")) 

An account value is a set of numbers surrounded by a space.

 xacts_count <- as.integer( stri_extract_all_regex(lines[i_comm_xacts], "(?<= )[0-9]+(?= )") ) ## [1] 41198 

Combining Results

 data.frame( Date = current_run_date, CPU = user_busy, Count = xacts_count ) 
0
source share

All Articles