How to transfer the contents of a large tar.gz file to STDOUT?

I have a large.tar.gz file containing about 1 million files, of which about 1/4 of them are html files, and I want to parse several lines of each of the html files inside.

I want it not necessary to extract the contents of large large.tar.gz to a folder and then parse the html files, instead I would like to know how I can transfer the contents of html files to large.tar.gz directly to STDOUT so that I can grep / parse the information I want from them?

I suppose there should be some kind of magic like:

 tar -special_flags large.tar.gz | grep_only_files_with_extension html | xargs -n1 head -n 99999 | ./parse_contents.pl - 

Any ideas?

+6
source share
1 answer

Use this with the GNU tar to extract tgz to stdout:

 tar -xOzf large.tar.gz --wildcards '*.html' | grep ... 

-O, --to-stdout : extract files to standard output

+16
source

All Articles