Why do people use tarballs?

As a mostly Windows developer, I might be missing out on something cultural in the Linux community, but he always confused me when I downloaded that the files were first put into the .tar archive and then zipped. Why a two-step process? Does search grouping pinches files? Are there any other benefits that I don't know about?

+72
linux package archive
Nov 17 '08 at 15:25
source share
15 answers

bzip and gzip work with single files, not groups of files. Regular old zip (and pkzip) work with groups of files and have the concept of a built-in archive.

The * nix philosophy is one of the small tools that perform very specific tasks very well and can be connected together. That's why there are two tools here that have specific tasks, and they are designed to blend well. It also means that you can use tar to group files, and then you have a choice of compression tool (bzip, gzip, etc.).

+116
Nov 17 '08 at 15:27
source share
β€” -

It is not strange that no one mentioned that modern versions of GNU tar allow you to compress when you merge:

 tar -czf output.tar.gz directory1 ... tar -cjf output.tar.bz2 directory2 ... 

You can also use the compressor of your choice if it supports the options < -c '(to output stdout or from stdin) and' -d (unpack):

 tar -cf output.tar.xxx --use-compress-program=xxx directory1 ... 

This will allow you to specify any alternative compressor.

[Added: if you extract compressed files from gzip or bzip2 , GNU tar will automatically detect them and run the corresponding program. That is, you can use:

 tar -xf output.tar.gz tar -xf output.tgz # A synonym for the .tar.gz extension tar -xf output.tar.bz2 

and they will be processed properly. If you are using a non-standard compressor, you need to specify what to remove.]

The reason for the separation is, as in the selected answer, separation of duties. Among other things, this means that people can use the cpio program to pack files (instead of tar ), and then use the selected compressor (once in a while, the preferred compressor was pack , later it was compress (which was much more efficient than pack ), and then gzip , which controlled the rings around both of its predecessors, and completely competed with zip (which was ported to Unix but not native there), and now bzip2 , which in my experience usually has an advantage of 10-20% over gzip .

[Added: someone noticed in his answer that cpio has fun conventions. This is true, but until GNU tar receives the appropriate parameters (' -T - '), cpio was the best command when you did not want to archive everything that was under this directory - you could choose exactly what Files have been archived. The drawback of cpio was that you could not only select files - you had to select them. There is another place where cpio scores; it can make an in-situ copy from one directory hierarchy to another without intermediate storage:

 cd /old/location; find . -depth -print | cpio -pvdumB /new/place 

By the way, the < -depth 'option on find is important in this context - it copies the contents of directories before setting permissions on the directories themselves. When I checked the command before entering the add-on to this answer, I copied several read-only directories (resolution 555); when I went to delete the copy, I had to disable directory permissions before rm -fr /new/place could finish. Without the -depth option, the cpio command failed. I just remembered this when I went to the cleaning - the formula I cited is automatic for me (mainly due to repeated repetitions over the years). ]

+25
Nov 17 '08 at 15:41
source share

An important difference is the nature of the two types of archives.

TAR files are more than concatenating the contents of a file with some headers, while gzip and bzip2 are stream compressors that tarballs apply to all concatenation.

ZIP files are a concatenation of individual compressed files with some headers. In fact, the DEFLATE algorithm is used by both zip and gzip, and with the appropriate binary settings, you can take the payload of the gzip stream and put it in a zip file with the appropriate headers and entries in the dictionary.

This means that two different archive types have different tradeoffs. For large collections of small TAR files, followed by a stream compressor, it usually results in a higher compression ratio than ZIP because the stream compressor will have more data to create its dictionary frequencies and thus be able to produce more redundant information. On the other hand, an error (preserving the file length) in the ZIP file will damage only those files whose compressed data has been affected. As a rule, in-line compressors cannot meaningfully recover from medium flow errors. Thus, ZIP files are more resistant to corruption, as part of the archive will still be available.

+24
Nov 17 '08 at 15:49
source share

The funny thing is: you can get behavior not expected by the creators of tar and gzip . For example, you can not only gzip a tar file, but also tar tar files to create files.gz.tar (this will be technically closer to how pkzip works). Or you can put another program in the pipeline, for example, some cryptography, and you can choose arbitrary order tarring, gzipping and encrypting. Anyone who has written a cryptographic program should not have the slightest idea how his program will be used, all he needs to do is read from standard input and write to standard output.

+16
Nov 17 '08 at 19:46
source share

In the Unix world, most applications are designed to do one thing, and do it well. The most popular zip utilities in Unix, gzip and bzip2, only for file compression. tar concatenates the file. Posting tar output to the compression utility does what you need without adding undue complexity to any piece of software.

+7
Nov 17 '08 at 15:31
source share

Another reason this is so common is that tar and gzip are found on almost the entire * NIX installation base. I think this is perhaps the biggest reason. This is why zip files are extremely common on Windows because support is built-in, regardless of the superior procedures in RAR or 7z.

GNU tar also allows you to create / extract these files from a single command (one step):

  • Create archive:
  • tar -cfvj destination.tar.bz2 *.files
  • tar -cfvz destination.tar.gz *.files

  • Extract archive: (the -C part is optional, the current directory is used by default)

  • tar -xfvj archive.tar.bz2 -C destination_path
  • tar -xfvz archive.tar.gz -C destination_path

This is what I devoted my memory to over many years on Linux, and recently on Nexenta (OpenSolaris).

+7
Nov 17 '08 at 16:14
source share

I think you were looking for a more historical context for this. The original zip code was for one file. Tar is used to place several files in one file. Therefore, tarring and zipping is a two-step process. Why he still dominates today, everyone guesses.

From wikipedia for Tar_ (file_format)

In the calculation, tar (obtained from the tape archive) is both a file format (as a bitstream type of the archive) and the name of the program used to process such files. The format was standardized by POSIX.1-1988, and then POSIX.1-2001. Originally designed as a raw format used for backup to tape and other serial access devices for backup purposes, it is now commonly used to map collections of files into one larger file for distribution or archiving while storing file system information such as a user and a group of permissions, dates, and directory structures.

+5
Nov 17 '08 at 15:34
source share

tar is popular mainly for historical reasons. Several alternatives are available. Some of them are about the same as resin, but cannot exceed the popularity of the resin for several reasons.

  • cpio (alien syntax, theoretically more consistent, but people love what they know, tasting prevails)
  • ar (popular a long time ago, now used to package library files)
  • shar (self-extracting shell scripts, had all kinds of problems, were popular nonetheless)
  • zip (due to licensing issues it was not available for many Unices)

The main advantage (and disadvantage) of tar is that it does not have a file header or a central content directory. For many years, he never suffered from file size restrictions (until this decade, when the 8 GB limit on files within the archive became a problem, it was resolved many years ago).

First of all, one of the disadvantages of tar.gz (or ar.Z, for that matter) is that you need to unzip the entire archive to extract individual files and print the contents of the archive, never harm people so they don’t caught in tar in significant quantities.

+3
Nov 17 '08 at 17:07
source share

tar is UNIX because UNIX is tar

In my opinion, the reason still using tar today is that this is one of the (possibly rare) cases where the UNIX approach just did it perfectly right from the start.

By carefully studying the stages of creating archives, I hope you will agree that the way to separate the various tasks here is the UNIX philosophy :

  • one tool ( tar to give it a name here), specializing in converting any selection of files, directories and symbolic links, including all data, such as timestamps, owners and permissions into a single stream of bytes.

  • and just another randomly interchangeable tool ( gzip bz2 xz , to name just a few options) that converts the stream of any byte stream to another (hopefully) smaller result stream.

Using this approach, you get a number of advantages for both the user and the developer:

  • extensibility Providing a tar pair with any compression algorithm that already exists or with any compression algorithm has yet to be developed without changing anything on the inner workings of tar in general.

    As soon as you receive the new tool "hyper-zip-utra" or "whater compression", you are ready to use it, hugging your new servant with all the power of tar .

  • tar stability has been in heavy use since the early 80s and has been launched on numerical operating systems and machines.

    Preventing the need to reinvent the wheel while maintaining ownership, permissions, time stamps, etc. again and again, for each new archiving, the tool not only saves a lot of time (otherwise, unnecessarily spent time) on development, but also guarantees the same reliability for each new application.

  • Consistency The user interface remains unchanged all the time.

    There is no need to remember that to restore permissions with tool A you need to pass the --i-hope-you-rember-this-one option, and with tool B you have to use --this-time-its-another-one when using the C it tool `--ope-you-didnt-try- with tool-in-switch.

    While using the D tool, you would really --if-you-had-used-tool-bs-switch-your-files-would-have-been-deleted-now it up if you didn't use --if-you-had-used-tool-bs-switch-your-files-would-have-been-deleted-now .

+3
Mar 19 '13 at 3:39 on
source share

gzip and bzip2 are just a compressor, not a software archiver. Hence the combination. You will need tar software to merge all the files.

ZIP and RAR are a combination of two processes.

+2
Nov 17 '08 at 15:28
source share

Usually in the * nux world, file packages are distributed as tarballs and then optionally gzipped. Gzip is a simple file compression program that does not bind files with this tar or zip.

At one time, zip didn’t handle some of the things that Unix tar and unix file systems considered normal, such as symbolic links, mixed file files, etc. I do not know if this has changed, but why do we use tar.

+2
Nov 17 '08 at 15:29
source share

Tar = group files in 1 file

GZip = Freeze file

They break the process down to 2. This is it.

In a Windows environment, you can use WinZip or WinRar, which make Zip. The zip process of this software groups a file and zips it, but you just don’t see the process.

+1
Nov 17 '08 at 15:27
source share

For the same reason why Mac users love disk images: they are a really convenient way to archive material, and then transfer it, upload or download, or email it, etc.

And easier to use and more portable than zip IMHO files.

+1
Jun 09 '09 at 18:18
source share

In my days of Altos-XENIX (1982), we started using tar (tape archiver) to extract files from 5 1/4 floppy disks or streaming tape, as well as copy to these media. This functionality is very similar to the BACKUP.EXE and RESTORE.EXE commands in DOS 5.0 and 6.22 as add-ons, allowing you to span multiple media if it cannot fit into just one. The downside was that if one of several carriers had problems, it was all useless. tar and dd come from UNIX SYstem III and remain the standard release utility for a UNIX-like OS, possibly for backward compatibility reasons.

+1
Jun 27. '10 at 5:23
source share

Tar is not only a file format, but also a tape format. Tapes store data in stages. Each storage implementation was common. Tar was a method by which you could extract data from a disk and store it on a tape so that other people could receive it without your special program.

Compression programs appeared later, and * nix still had only one way to create a single file containing multiple files.

I think this is just the inertia that continued with the tar.gz trend. Pkzip started with compression and archiving in one fell swoop, but then DOS systems usually didn't connect to tape drives!

From wikipedia for Tar_ (file_format)

In the calculation, tar (obtained from the tape archive) is both a file format (as a bitstream type of the archive) and the name of the program used to process such files. The format was standardized by POSIX.1-1988, and then POSIX.1-2001. Originally designed as a raw format used for backup to tape and other serial access devices for backup purposes, it is now commonly used to map collections of files into one larger file for distribution or archiving while storing file system information such as a user and a group of permissions, dates, and directory structures.

0
Nov 17 '08 at 18:53
source share



All Articles