Effectively extract a single file from a .tar archive

Question

Effectively extract a single file from a .tar archive

I have a 2 GB .tgz file.

I want to extract only one 2KB .txt file from a .tgz file.

I have the following code:

 import tarfile from contextlib import closing with closing(tarfile.open("myfile.tgz")) as tar: subdir_and_files = [ tarinfo for tarinfo in tar.getmembers() if tarinfo.name.startswith("myfile/first/second/text.txt") ] print subdir_and_files tar.extractall(members=subdir_and_files)

The problem is that it takes at least one minute to retrieve the extracted file. It seems that extractall extract the whole file, but save only the one I asked.

Is there a more efficient way to achieve it?

+5

python

Middleware Apr 08 '15 at 6:55

source share

1 answer

Yaroslav Rakhmatullin · Answer 1 · 2017-09-13T16:51:39+0000

Not.

The tar format is not suitable for quickly extracting single files. This condition is exacerbated in most cases because the tar file is usually in a compressed stream. I would suggest 7z.

Yes, sort of.

If you know that there is only one file with this name, or if you want only one file, you can interrupt the extraction process after the first hit.

eg.

fully scan an item.

 $ time tar tf /var/log/apache2/old/2016.tar.xz 2016/ 2016/access.log-20161023 2016/access.log-20160724 2016/ssl_access.log-20160711 2016/error.log-20160815 (...) 2016/error.log-20160918 2016/ssl_request.log-20160814 2016/access.log-20161017 2016/access.log-20160516 time: Real 0m1.5s User 0m1.4s System 0m0.2s

scan item from memory

 $ time tar tf /var/log/apache2/old/2016.tar.xz > /dev/null time: Real 0m1.3s User 0m1.2s System 0m0.2s

abort after the first file

 $ time tar tf /var/log/apache2/old/2016.tar.xz | head -n1 2016/ time: Real 0m0.0s User 0m0.0s System 0m0.0s

abort after three files

 $ time tar tf /var/log/apache2/old/2016.tar.xz | head -n3 2016/ 2016/access.log-20161023 2016/access.log-20160724 time: Real 0m0.0s User 0m0.0s System 0m0.0s

interrupt after some file in the "average"

 $ time tar xf /var/log/apache2/old/2016.tar.xz 2016/access.log-20160724 | head -n1 time: Real 0m0.9s User 0m0.9s System 0m0.1s

abort after some file at the bottom

 $ time tar xf /var/log/apache2/old/2016.tar.xz 2016/access.log-20160516 | head -n1 time: Real 0m1.1s User 0m1.1s System 0m0.2s

I show you that if you kill the output channel (standard output) from GNU tar by exiting the first line (head -n1), then the tar process will also die.

You can see that reading the entire archive takes longer than interrupting after some file close to the "bottom" of the archive. You can also see that interrupting reading after meeting the file at the top takes significantly less time.

I would not do this if I could choose the archive format.

Soo ...

Instead of everything that python people love, replay with tar.getmembers() (or something that gives you one file at a time in this library) and interrupt when you encounter the desired result, rather than expand all files to list.

Effectively extract a single file from a .tar archive

Not.

Yes, sort of.

More articles: