The fastest way to print a specific part of a file using bash commands

I am currently using sed to print the required part of the file. For example, I used the following command

sed -n 89001,89009p file.xyz 

However, this is rather slow as the file size increases (my file is currently 6.8 GB). I tried to follow this link and used the command

 sed -n '89001,89009{p;q}' file.xyz 

But this command only prints the 89001st line. Please help me.

+5
source share
6 answers

The syntax is slightly different:

 sed -n '89001,89009p;89009q' file.xyz 

UPDATE:

Since there is an answer with awk , I made a little comparison, and as I thought sed works a little faster:

 $ wc -l large-file 100000000 large-file $ du -h large-file 954M large-file $ time sed -n '890000,890010p;890010q' large-file > /dev/null real 0m0.141s user 0m0.068s sys 0m0.000s $ time awk 'NR>=890000{print} NR==890010{exit}' large-file > /dev/null real 0m0.433s user 0m0.208s sys 0m0.008s` 

UPDATE2:

There is a faster method with awk that is sent by @EdMorton, but still not as fast as sed :

 $ time awk 'NR>=890000{print; if (NR==890010) exit}' large-file > /dev/null real 0m0.252s user 0m0.172s sys 0m0.008s 

UPDATE:

This is the fastest way to find ( head and tail ):

 $ time head -890010 large-file| tail -10 > /dev/null real 0m0.085s user 0m0.024s sys 0m0.016s 
+7
source
 awk 'NR>=89001{print; if (NR==89009) exit}' file.xyz 
+4
source

easier to read in awk, performance should be similar to sed

 awk 'NR>=89001{print} NR==89009{exit}' file.xyz 

you can replace {print} with a semicolon.

+2
source

David Grabowski's helpful answer is the way to go (with sed [1] ; Ed Morton's helpful answer is a viable alternative to awk ; The tail + head combination will usually be the fastest [2] ).

As for why your approach didn't work :

A bidirectional expression, such as 89001,89009 , selects an inclusive range of strings, limited by the start and end addresses (line numbers in this case).

The corresponding list of functions {p;q;} then executed for each row in the selected range.

Thus, line # 89001 is the first line that causes the list of functions to be executed: immediately after printing ( p ), the line is executed, the q function is executed - which immediately ends the execution, without processing any further lines.

To prevent a premature termination, Dawid's answer therefore separates the print aspect ( p ) of all lines in the range from processing completion ( q ) using two commands separated by ; :

  • 89001,89009p prints all lines in a range
  • 89009q ends processing when the endpoint of the range is reached.

[1] A slightly less repetitive reformulation that should work equally well ( $ is the last line that is never reached due to the second command):
sed -n '89001,$ p; 89009 q'

[2] The best reformulation of the head + tail solution from David's answer is tail -n +89001 file | head -n 9 tail -n +89001 file | head -n 9 , because it overlaps the number of bytes that are of no interest, it is still sent through the channel in the size of the buffer-buffer (the typical size of the buffer-buffer is 64 KB). Using GNU utilities (Linux), this is the fastest solution, but on OSX with stock utilities (BSD), sed is the fastest.

+2
source

Another way to do this would be to use a combination of head and tail :

 $ time head -890010 large-file| tail -10 > /dev/null real 0m0.085s user 0m0.024s sys 0m0.016s 

It is faster than sed and awk .

0
source

Searching at the beginning of the file requires sed to find the line N'th. To speed things up, divide the large file into a fixed number of line spacing and index file. Then use dd to skip the early parts of the large file before loading sed .

Create an index file using:

 #!/bin/bash INTERVAL=1000 LARGE_FILE="big-many-GB-file" INDEX_FILE="index" LASTSTONE=123 MILESTONE=0 echo $MILESTONE > $INDEX_FILE while [ $MILESTONE != $LASTSTONE ] ;do LASTSTONE=$MILESTONE MILESTONE=$(dd if="$LARGE_FILE" bs=1 skip=$LASTSTONE 2>/dev/null |head -n$INTERVAL |wc -c) MILESTONE=$(($LASTSTONE+$MILESTONE)) echo $MILESTONE >> $INDEX_FILE done exit 

Then find the line using: ./ this_script.sh 89001

 #!/bin/bash INTERVAL=1000 LARGE_FILE="big-many-GB-file" INDEX_FILE="index" LN=$(($1-1)) OFFSET=$(head -n$((1+($LN/$INTERVAL))) $INDEX_FILE |tail -n1) LN=$(($LN-(($LN/$INTERVAL)*$INTERVAL))) LN=$(($LN+1)) dd if="$LARGE_FILE" bs=1 skip=$OFFSET 2>/dev/null |sed -n "$LN"p 
-2
source

All Articles