The description is quite long, so please bear with me:
I have log files ranging in size from 300 mb to 1.5 GB that need to be filtered with a search key.
The log format looks something like this:
24 May 2017 17:00:06,827 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content 24 May 2017 17:00:06,828 [INFO] 567890 (Blah : Blah1) Service-name:: Content( May span multiple lines) 24 May 2017 17:00:06,829 [INFO] 123456 (Blah : Blah2) Service-name: Multiple line content. Printing Object[ ID1=fac-adasd ID2=123231 ID3=123108 Status=Unknown Code=530007 Dest=CA ] 24 May 2017 17:00:06,830 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content 4 May 2017 17:00:06,831 [INFO] 567890 (Blah : Blah2) Service-name:: Content( May span multiple lines)
Given the search key 123456, I need to get the following:
24 May 2017 17:00:06,827 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content 24 May 2017 17:00:06,829 [INFO] 123456 (Blah : Blah2) Service-name: Multiple line content. Printing Object[ ID1=fac-adasd ID2=123231 ID3=123108 Status=Unknown Code=530007 Dest=CA ] 24 May 2017 17:00:06,830 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content
The following awk script does my job (very slow):
gawk '/([0-9]{1}|[0-9]{2})\s\w+\s[0-9]{4}/{n=0}/123456/{n=1} n'
It takes about 8 minutes to find a 1 GB log file. And I need to do this for many such files. To top it all off, I have several such search keys, which makes the whole task impossible.
My initial solution is to use multithreading. I used fixedThreadPoolExecutor, sent a task for each file that needs to be filtered. Inside the task description, I created a new process using java Runtime (), which would execute a gawk script using bash and write the output to a file, and then merge all the files.
Although this may seem unsuccessful, since filtering is dependent on I / O, not on the processor, it gave me an acceleration compared to running a script for each file sequentially.
But this is still not enough, since all this takes 2 hours, for one search key, with 27 GB of log files. On average, I have 4 such search keys and you need to get all their results and connect them.
My method is inefficient because:
A) It accesses each log file several times when multiple search keys are specified and causes an even greater amount of I / O.
B) It carries the overhead of creating a process within each thread.
A simple solution for all this, departs from awk and does it all in java, using some library of regular expressions. The question is, what is a regular expression library that can provide me with the desired result?
With awk, I have a property /filter/{action} that allows me to specify a range of several lines to be captured (as seen above). How can i do the same inside java?
I am open to all kinds of offers. For example, an extreme option would be to store the log files in a common file system, such as S3, and process the output using multiple computers.
I am new to stackoverflow and I donβt even know if I can post it here. But I'm working on it last week, and I need someone who has experience to help me with this. Thanks in advance.