I have a large.tar.gz file containing about 1 million files, of which about 1/4 of them are html files, and I want to parse several lines of each of the html files inside.
I want it not necessary to extract the contents of large large.tar.gz to a folder and then parse the html files, instead I would like to know how I can transfer the contents of html files to large.tar.gz directly to STDOUT so that I can grep / parse the information I want from them?
I suppose there should be some kind of magic like:
tar -special_flags large.tar.gz | grep_only_files_with_extension html | xargs -n1 head -n 99999 | ./parse_contents.pl -
Any ideas?
source share