Like zgrep last line of a gz file without tail

Here is my problem, I have a set of large gz log files, the very first information in the line is the text in datetime format, for example: 2014-03-20 05:32:00.

I need to check which set of log files contains certain data. For init, I just do:

  '-query-data-' zgrep -m 1 '^20140320-04' 20140320-0{3,4}*gz 

BUT HOW to do the same with the last line without the process the whole file, as would be done with zcat (too heavy):

 zcat foo.gz | tail -1 

Additional information, these logs are created with the time of the initial recording data, so if I want to request logs at 14:00:00, I should also look for files created before 14:00:00, as the file will be created at 13:50 : 00 and closed at 14:10:00.

+8
bash shell grep logging
source share
1 answer

The easiest solution would be to change the log rotation to create small files.

The second simple solution is to use a compression tool that supports random access.

Projects such as dictzip , BGZF and csio each add synchronization points for cleaning at various intervals in gzip-compressed data, which allow you to search in a program aware of this additional information. Although it exists in the standard, vanilla gzip does not add such markers by default or by option.

Files compressed by these random access utilities are slightly larger (2-20%) due to the markers themselves, but fully support decompression using gzip or another utility that does not know these markers.

You can learn more about this random access question in various compression formats .

There is also a Peter Cock's Blasted Bioinformatics blog with several posts on the subject, including:


xz experiments

xz ( LZMA format ) actually has random access support at the block level, but you only get one block with default settings.

File creation

xz can merge several archives together, in which case each archive will have its own block. GNU split can do this easily:

 split -b 50M --filter 'xz -c' big.log > big.log.sp.xz 

This tells split to split big.log into pieces of 50 MB (before compression) and run each via xz -c , which outputs the compressed piece to standard output. Then we collect this standard output into a single file named big.log.sp.xz

To do this without GNU, you need a loop:

 split -b 50M big.log big.log-part for p in big.log-part*; do xz -c $p; done > big.log.sp.xz rm big.log-part* 

Syntactic

You can get a list of block offsets using xz --verbose --list FILE.xz If you want to use the last block, you need its compressed size (column 5) plus 36 bytes for service information (found by comparing the size with hd big.log.sp0.xz |grep 7zXZ ). Extract this block with tail -c and pipe through xz . Since the above question requires the last line of the file, I then pass it via tail -n1 :

 SIZE=$(xz --verbose --list big.log.sp.xz |awk 'END { print $5 + 36 }') tail -c $SIZE big.log.sp.xz |unxz -c |tail -n1 

Side note

Version 5.1.1 introduced support for the --block-size flag:

 xz --block-size=50M big.log 

However, I was not able to extract the specific block, since it does not include the full headers between the blocks. I suspect that this is not trivial to do from the command line.

gzip experiments

gzip also supports concatenation. I (briefly) tried to imitate this gzip process with no luck. gzip --verbose --list does not provide enough information and it seems that the headers are too variable to search.

To do this, you need to add synchronization points for synchronization, and since their size depends on the size of the last buffer in the previous compression, which is too difficult to do on the command line (use dictzip or another of the previously discussed tools).

I did apt-get install dictzip and played with dictzip, but not much. It does not work without arguments, creating an archive (massive!) .dz that neither dictunzip nor dictunzip could understand.

bzip2 experiments

bzip2 there are headers we can find. It is still a bit dirty, but it works.

Creature

This is similar to the xz procedure above:

 split -b 50M --filter 'bzip2 -c' big.log > big.log.sp.bz2 

It should be noted that this is significantly slower than xz (48 minutes for bzip2 versus 17 minutes for xz versus 1 min for xz -0 ), and also significantly larger (97M for bzip2 versus 25M for xz -0 versus 15M for xz), according to at least for my test log file.

Syntactic

This is a bit more complicated because we do not have a good index. We have to guess where to go, and we have to make mistakes on the scan side too much, but with a massive file we still save I / O.

My guess for this test was 50,000,000 (from the original 52428800, a pessimistic assumption that is not pessimistic enough, for example, for the H.264 movie).

 GUESS=50000000 LAST=$(tail -c$GUESS big.log.sp.bz2 \ |grep -abo 'BZh91AY&SY' |awk -F: 'END { print '$GUESS'-$1 }') tail -c $LAST big.log.sp.bz2 |bunzip2 -c |tail -n1 

It takes only the last 50 million bytes, finds the binary offset of the last BZIP2 header, subtracts this from the guess size, and pulls as many bytes from the end of the file. Only this part is unpacked and placed in tail .

Because you need to double-request the compressed file and perform an additional scan (calling grep looks for a header that examines all the guessed space), this is a suboptimal solution. See also the section below on how slow bzip2 .

Perspective

Given how fast xz , this is easily the best choice; using the fastest option ( xz -0 ), compress or decompress quite quickly and create a smaller file than gzip or bzip2 in the log file I tested with. Other tests (as well as various sources on the Internet) show that xz -0 preferable to bzip2 in all scenarios.

             โ€”โ€”โ€”โ€”โ€” No Random Access โ€”โ€”โ€”โ€”โ€”โ€” โ€”โ€”โ€”โ€”โ€”โ€”โ€” Random Access โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”
 FORMAT SIZE RATIO WRITE READ SIZE RATIO WRITE SEEK
 โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€“โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€” โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”
 (original) 7211M 1.0000 - 0:06 7211M 1.0000 - 0:00
 bzip2 96M 0.0133 48:31 3:15 97M 0.0134 47:39 0:00
 gzip 79M 0.0109 0:59 0:22                                  
 dictzip 605M 0.0839 1:36 (fail)
 xz -0 25M 0.0034 1:14 0:12 25M 0.0035 1:08 0:00
 xz 14M 0.0019 16:32 0:11 14M 0.0020 16:44 0:00

The time tests were not exhaustive, I did not learn anything, and disk caching was used. However, they look correct; there is a very small amount of overhead from split plus the launch of 145 compression instances, and not just one (it could even be a pure gain if it allows another multi-threaded utility to consume multiple threads).

+12
source share

All Articles