Best way to modify a file when using pipes?

I often have shell programming tasks in which I run this template:

cat file | some_script > file 

This is unsafe - the cat may not have read the entire file before some_script starts writing to it. I really don't want to write the result to a temporary file (it's slow, and I don't want to add the complexity of coming up with a unique new name).

Perhaps there is a standard shell command that will buffer the entire stream until EOF is reached? Something like:

 cat file | bufferUntilEOF | script > file 

Ideas?

+8
bash shell pipe
source share
8 answers

You are looking for a sponge .

+4
source share

Using a temporary file is the right solution here. When you use ">" type redirection, it is processed by the shell, and no matter how many commands are in your pipeline, the shell can freely delete and overwrite the output file before executing any command (when setting up the pipeline).

+4
source share

Like many others, I like to use temporary files. I use the shell process identifier as part of the temporary name, so if multiple instances of the script work at the same time, they will not conflict. Finally, I only overwrite the original file if the script succeeds (using a short circuit with the Boolean operator - it is a little tight, but very nice for simple command lines). Putting it all together, it will look like this:

 some_script < file > smscrpt.$$ && mv smscrpt.$$ file 

This will result in a temporary file if the command fails. If you want to clear the error, you can change it to:

 some_script < file > smscrpt.$$ && mv smscrpt.$$ file || rm smscrpt.$$ 

By the way, I got rid of the cat’s misuse and replaced it with input redirection.

+3
source share

Using mktemp(1) or tempfile(1) saves you money on creating a unique file name.

+2
source share

Using a temporary file is IMO better than trying to buffer data in a pipeline.

It almost defeats the goal of pipelines for buffering them.

+1
source share

I think the best way is to use a temporary file. However, if you need a different approach, you can use something like awk to buffer input into memory before your application starts receiving input. The following script will buffer all the input to the lines array before it starts to output it to the next user in the pipeline.

 { lines[NR] = $0; } END { for (line_no=1; line_no<=NR; ++line_no) { print lines[line_no]; } } 

You can collapse it into a single line if you want:

 cat file | awk '{lines[NR]=$0;} END {for(i=1;i<=NR;++i) print lines[i];}' > file 

With all this, I still recommend using a temporary file for output and then overwriting the original file.

+1
source share

In response to the OP question above about using sponge without external dependencies and on @ D.Shawley answer , you can only have a sponge effect with gawk dependency, which is not uncommon on Unix or Unix-like systems:

 cat foo | gawk -voutfn=foo '{lines[NR]=$0;} END {if(NR>0){print lines[1]>outfn;} for(i=2;i<=NR;++i) print lines[i] >> outfn;}' 

Checking for NR>0 is truncating the input file.

To use this in a shell script, change -voutfn=foo to -voutfn="$1" or any other syntax your shell uses for file name arguments. For example:

 #!/bin/bash cat "$1" | gawk -voutfn="$1" '{lines[NR]=$0;} END {if(NR>0){print lines[1]>outfn;} for(i=2;i<=NR;++i) print lines[i] >> outfn;}' 

Please note that unlike a real sponge , this may be limited by RAM size. sponge actually performs buffering in a temporary file, if necessary.

+1
source share

I think you need to use mktemp . Something like this will work:

 FILE=example-input.txt TMP='mktemp' some_script <"$FILE" >"$TMP" mv "$TMP" "$FILE" 
0
source share

All Articles