The fastest way to print a specific part of a file using bash commands

Question

The fastest way to print a specific part of a file using bash commands

I am currently using sed to print the required part of the file. For example, I used the following command

sed -n 89001,89009p file.xyz

However, this is rather slow as the file size increases (my file is currently 6.8 GB). I tried to follow this link and used the command

 sed -n '89001,89009{p;q}' file.xyz

But this command only prints the 89001st line. Please help me.

+5

bash shell sed

Sharma SRK Chaitanya Yamijala Aug 27 '16 at 3:11

source share

6 answers

 awk 'NR>=89001{print; if (NR==89009) exit}' file.xyz

+4

Ed morton Aug 27 '16 at 13:33

source share

easier to read in awk, performance should be similar to sed

 awk 'NR>=89001{print} NR==89009{exit}' file.xyz

you can replace {print} with a semicolon.

+2

karakfa Aug 27 '16 at 3:50

source share

David Grabowski's helpful answer is the way to go (with sed ^[1] ; Ed Morton's helpful answer is a viable alternative to awk ; The tail + head combination will usually be the fastest ^[2] ).

As for why your approach didn't work :

A bidirectional expression, such as 89001,89009 , selects an inclusive range of strings, limited by the start and end addresses (line numbers in this case).

The corresponding list of functions {p;q;} then executed for each row in the selected range.

Thus, line # 89001 is the first line that causes the list of functions to be executed: immediately after printing ( p ), the line is executed, the q function is executed - which immediately ends the execution, without processing any further lines.

To prevent a premature termination, Dawid's answer therefore separates the print aspect ( p ) of all lines in the range from processing completion ( q ) using two commands separated by ; :

89001,89009p prints all lines in a range
89009q ends processing when the endpoint of the range is reached.

^{[1] A slightly less repetitive reformulation that should work equally well ( $ is the last line that is never reached due to the second command):} ^{sed -n '89001,$ p; 89009 q'}

^{[2] The best reformulation of the head + tail solution from David's answer is tail -n +89001 file | head -n 9} ^{tail -n +89001 file | head -n 9 , because it overlaps the number of bytes that are of no interest, it is still sent through the channel in the size of the buffer-buffer (the typical size of the buffer-buffer is 64 KB).} ^{Using GNU utilities (Linux), this is the fastest solution, but on OSX with stock utilities (BSD), sed is the fastest.}

+2

mklement0 Aug 27 '16 at 3:58

source share

Another way to do this would be to use a combination of head and tail :

 $ time head -890010 large-file| tail -10 > /dev/null real 0m0.085s user 0m0.024s sys 0m0.016s

It is faster than sed and awk .

0

Dawid grabowski Aug 28 '16 at 21:35

source share

Searching at the beginning of the file requires sed to find the line N'th. To speed things up, divide the large file into a fixed number of line spacing and index file. Then use dd to skip the early parts of the large file before loading sed .

Create an index file using:

 #!/bin/bash INTERVAL=1000 LARGE_FILE="big-many-GB-file" INDEX_FILE="index" LASTSTONE=123 MILESTONE=0 echo $MILESTONE > $INDEX_FILE while [ $MILESTONE != $LASTSTONE ] ;do LASTSTONE=$MILESTONE MILESTONE=$(dd if="$LARGE_FILE" bs=1 skip=$LASTSTONE 2>/dev/null |head -n$INTERVAL |wc -c) MILESTONE=$(($LASTSTONE+$MILESTONE)) echo $MILESTONE >> $INDEX_FILE done exit

Then find the line using: ./ this_script.sh 89001

 #!/bin/bash INTERVAL=1000 LARGE_FILE="big-many-GB-file" INDEX_FILE="index" LN=$(($1-1)) OFFSET=$(head -n$((1+($LN/$INTERVAL))) $INDEX_FILE |tail -n1) LN=$(($LN-(($LN/$INTERVAL)*$INTERVAL))) LN=$(($LN+1)) dd if="$LARGE_FILE" bs=1 skip=$OFFSET 2>/dev/null |sed -n "$LN"p

-2

ronybc Aug 27 '16 at 16:17

source share

Dawid grabowski · Accepted Answer · 2016-08-27T03:24:00+0000

The syntax is slightly different:

 sed -n '89001,89009p;89009q' file.xyz

UPDATE:

Since there is an answer with awk , I made a little comparison, and as I thought sed works a little faster:

 $ wc -l large-file 100000000 large-file $ du -h large-file 954M large-file $ time sed -n '890000,890010p;890010q' large-file > /dev/null real 0m0.141s user 0m0.068s sys 0m0.000s $ time awk 'NR>=890000{print} NR==890010{exit}' large-file > /dev/null real 0m0.433s user 0m0.208s sys 0m0.008s`

UPDATE2:

There is a faster method with awk that is sent by @EdMorton, but still not as fast as sed :

 $ time awk 'NR>=890000{print; if (NR==890010) exit}' large-file > /dev/null real 0m0.252s user 0m0.172s sys 0m0.008s

UPDATE:

This is the fastest way to find ( head and tail ):

 $ time head -890010 large-file| tail -10 > /dev/null real 0m0.085s user 0m0.024s sys 0m0.016s

The fastest way to print a specific part of a file using bash commands

More articles: