Filtering logs with regex in java

Question

Filtering logs with regex in java

The description is quite long, so please bear with me:
I have log files ranging in size from 300 mb to 1.5 GB that need to be filtered with a search key.

The log format looks something like this:

24 May 2017 17:00:06,827 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content 24 May 2017 17:00:06,828 [INFO] 567890 (Blah : Blah1) Service-name:: Content( May span multiple lines) 24 May 2017 17:00:06,829 [INFO] 123456 (Blah : Blah2) Service-name: Multiple line content. Printing Object[ ID1=fac-adasd ID2=123231 ID3=123108 Status=Unknown Code=530007 Dest=CA ] 24 May 2017 17:00:06,830 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content 4 May 2017 17:00:06,831 [INFO] 567890 (Blah : Blah2) Service-name:: Content( May span multiple lines)

Given the search key 123456, I need to get the following:

 24 May 2017 17:00:06,827 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content 24 May 2017 17:00:06,829 [INFO] 123456 (Blah : Blah2) Service-name: Multiple line content. Printing Object[ ID1=fac-adasd ID2=123231 ID3=123108 Status=Unknown Code=530007 Dest=CA ] 24 May 2017 17:00:06,830 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content

The following awk script does my job (very slow):

 gawk '/([0-9]{1}|[0-9]{2})\s\w+\s[0-9]{4}/{n=0}/123456/{n=1} n'

It takes about 8 minutes to find a 1 GB log file. And I need to do this for many such files. To top it all off, I have several such search keys, which makes the whole task impossible.

My initial solution is to use multithreading. I used fixedThreadPoolExecutor, sent a task for each file that needs to be filtered. Inside the task description, I created a new process using java Runtime (), which would execute a gawk script using bash and write the output to a file, and then merge all the files.

Although this may seem unsuccessful, since filtering is dependent on I / O, not on the processor, it gave me an acceleration compared to running a script for each file sequentially.

But this is still not enough, since all this takes 2 hours, for one search key, with 27 GB of log files. On average, I have 4 such search keys and you need to get all their results and connect them.

My method is inefficient because:

A) It accesses each log file several times when multiple search keys are specified and causes an even greater amount of I / O.
B) It carries the overhead of creating a process within each thread.

A simple solution for all this, departs from awk and does it all in java, using some library of regular expressions. The question is, what is a regular expression library that can provide me with the desired result?
With awk, I have a property /filter/{action} that allows me to specify a range of several lines to be captured (as seen above). How can i do the same inside java?

I am open to all kinds of offers. For example, an extreme option would be to store the log files in a common file system, such as S3, and process the output using multiple computers.

I am new to stackoverflow and I don’t even know if I can post it here. But I'm working on it last week, and I need someone who has experience to help me with this. Thanks in advance.

+8

java algorithm regex logging awk

gitmorty Jun 21 '17 at 8:52

source share

2 answers

Dinu sorin · Answer 1 · 2017-06-21T15:28:27+0000

You have several options.

The best of these would be using an inverted dictionary. This means that for each keyword x that is present in at least one of the logs, you keep a link to all the logs that contain it. But since you already spent a week on this task, I would advise you to use what is already there, and it does just that: Elasticsearch . In fact, you can use the full ELK stack (elasticsearch, logstash, kibana - mainly for logs) to even parse the logs, since you can simply put the regular expression in the configuration file. You will only need to index the files once and receive requests accurate to a few milliseconds.

If you really want to spend energy and not go for a better solution, you can use map-reduce on hadoop to filter the log. But this is not a task when map-reduce is optimal, and it will be more like a hack.

Leo aso · Answer 2 · 2017-06-22T15:15:50+0000

Switching to Java may not be the best option if you want to speed up the execution time, but if you are considering it, I wrote a Java class that can help.

You can use it to simultaneously search for one or more keys in a file. Since you are reading a log file, you can safely assume that all lines correspond to the correct format without errors. Therefore, instead of a regular expression that checks the entire string, it simply skips where the key should be (digits after the first ] ), and compares it with the desired value (provided that it is always a number).

Use it as follows:

 Set<Integer> keys = new HashSet(); keys.add(123456); keys.add(314159); /* synchronously (omitting 3rd argument prints to stdout) */ new KeySearch('path/to/file.log', keys).run(); /* asynchronously!!! (to use PrintStream, create the output file first) */ PrintStream ps1 = new PrintStream('lines-found1.log'); PrintStream ps2 = new PrintStream('lines-found2.log'); new Thread(new KeySearch('path/to/1.log', keys, ps1::println)).start(); new Thread(new KeySearch('path/to/2.log', keys, ps2::println)).start();

The third argument is the custom KeySearch.Callback interface, which retrieves the rows as they are found. As an example, I use the method reference, but that might be all you want. Here is a class (at least Java 8 is required).

 import java.io.*; import java.util.*; public class KeySearch implements Runnable { public interface Callback { void lineFound(String line); } private final Set<Integer> keys; private final Callback callback; private final String name; public KeySearch(String fileName, Collection<Integer> keys) { this(fileName, keys, System.out::println); } public KeySearch(String fileName, Collection<Integer> keys, Callback call) { this.keys = new HashSet<>(keys); this.name = fileName; this.callback = call; } @Override public void run() { String s; try(FileReader fr = new FileReader(name); BufferedReader br = new BufferedReader(fr)) { while ((s = readLine(br)) != null) if (matches(s)) callback.lineFound(s); } catch (IOException e) { System.err.println("Error reading " + name); throw new RuntimeException(e); } } private boolean matches(String line) { return keys.contains(getKeyOf(line)); } private String readLine(BufferedReader reader) throws IOException { StringBuilder line = new StringBuilder(); String next; do { next = reader.readLine(); if (next == null) return null; line.append(next).append(System.lineSeparator()); } while (next.lastIndexOf('[') > next.lastIndexOf(']')); return line.toString(); } private boolean isDigit(CharSequence s, int i) { char c = s.charAt(i); return c >= '0' && c <= '9'; } private int getKeyOf(String line) { // find the first ] (eg at the end of [INFO]) // and read the first number after it int start = line.indexOf(']'); while (!isDigit(line, start)) start++; int end = start; while (isDigit(line, end)) end++; return Integer.parseInt(line.substring(start, end)); } }

Filtering logs with regex in java

More articles: