How to transfer the contents of a large tar.gz file to STDOUT?

Question

How to transfer the contents of a large tar.gz file to STDOUT?

I have a large.tar.gz file containing about 1 million files, of which about 1/4 of them are html files, and I want to parse several lines of each of the html files inside.

I want it not necessary to extract the contents of large large.tar.gz to a folder and then parse the html files, instead I would like to know how I can transfer the contents of html files to large.tar.gz directly to STDOUT so that I can grep / parse the information I want from them?

I suppose there should be some kind of magic like:

 tar -special_flags large.tar.gz | grep_only_files_with_extension html | xargs -n1 head -n 99999 | ./parse_contents.pl -

Any ideas?

+6

bash

719016 Dec 9 '15 at 10:45

source share

1 answer

Cyrus · Accepted Answer · 2015-12-09T10:50:00+0000

Use this with the GNU tar to extract tgz to stdout:

 tar -xOzf large.tar.gz --wildcards '*.html' | grep ...

-O, --to-stdout : extract files to standard output

How to transfer the contents of a large tar.gz file to STDOUT?

More articles: