Why is the output of subprocess.run different from the output of the shell of the same command?

Question

Why is the output of subprocess.run different from the output of the shell of the same command?

I am using subprocess.run() for some automated testing. Mainly for automation of execution:

 dummy.exe < file.txt > foo.txt diff file.txt foo.txt

If you perform the above redirection in a shell, both files are always identical. But whenever file.txt too long, the Python code below does not return the correct result.

This is the Python code:

 import subprocess import sys def main(argv): exe_path = r'dummy.exe' file_path = r'file.txt' with open(file_path, 'r') as test_file: stdin = test_file.read().strip() p = subprocess.run([exe_path], input=stdin, stdout=subprocess.PIPE, universal_newlines=True) out = p.stdout.strip() err = p.stderr if stdin == out: print('OK') else: print('failed: ' + out) if __name__ == "__main__": main(sys.argv[1:])

Here is the C ++ code in dummy.cc :

 #include <iostream> int main() { int size, count, a, b; std::cin >> size; std::cin >> count; std::cout << size << " " << count << std::endl; for (int i = 0; i < count; ++i) { std::cin >> a >> b; std::cout << a << " " << b << std::endl; } }

file.txt could be something like this:

 1 100000 0 417 0 842 0 919 ...

The second integer in the first row - number of rows is, however here file.txt be long 100,001 rows.

Question: Am I abusing subprocess.run ()?

Edit

My exact Python code after the comment (newlines, rb) is taken into account:

 import subprocess import sys import os def main(argv): base_dir = os.path.dirname(__file__) exe_path = os.path.join(base_dir, 'dummy.exe') file_path = os.path.join(base_dir, 'infile.txt') out_path = os.path.join(base_dir, 'outfile.txt') with open(file_path, 'rb') as test_file: stdin = test_file.read().strip() p = subprocess.run([exe_path], input=stdin, stdout=subprocess.PIPE) out = p.stdout.strip() if stdin == out: print('OK') else: with open(out_path, "wb") as text_file: text_file.write(out) if __name__ == "__main__": main(sys.argv[1:])

Here is the first diff:

Here is the input file: https://drive.google.com/open?id=0B--mU_EsNUGTR3VKaktvQVNtLTQ

+5

c ++ python python-3.x subprocess io-redirection

user2346536 Jun 09 '16 at 19:18

source share

2 answers

I will start with a disclaimer: I do not have Python 3.5 (so I can not use the run function), and I could not reproduce your problem on Windows (Python 3.4.4) or Linux (3.1.6). Nonetheless...

Problems with `subprocess.PIPE` and the family

subprocess.run docs say this is just an interface to the old subprocess.Popen -and- communicate() technique. subprocess.Popen.communicate docs warn that:

Reading data is buffered in memory, so do not use this method if the data size is large or unlimited.

This seems like your problem. Unfortunately, the documents do not indicate how much data is "large", and what happens after reading "too much" data. Just "don't do it then."

The documents for subprocess.call are given a little more (my attention) ...

Do not use stdout=PIPE or stderr=PIPE with this function. The child process blocks if it generates enough output to the channel to fill the buffer of the OS buffer , because the channels are not read.

... as well as documents for subprocess.Popen.wait :

This will be inhibited when using stdout=PIPE or stderr=PIPE , and the child process generates enough output to the channel so that it blocks waiting for the OS buffer to receive more data. Use Popen.communicate() when using pipes to avoid this.

Of course, this means that Popen.communicate is the solution to this problem, but communicate your own documents say: “Do not use this method if the data size is large” - this is exactly the situation when the wait tag informs you should use communicate . (Maybe this "avoids that" by silently dropping data to the floor?)

Disappointingly, I see no way to use subprocess.PIPE safely if you are not sure you can read from it faster than your child process writes to it.

In this post ...

Alternative: `tempfile.TemporaryFile`

You keep all your data in memory ... twice. This may not be effective, especially if it is already in the file.

If you are allowed to use a temporary file, you can easily compare two files one line at a time. This avoids the mess of subprocess.PIPE , and it is much faster because it uses only a little RAM at a time. (The IO from your subprocess may be faster, depending on how your operating system handles output redirection.)

Again, I cannot test run , so here is a slightly older Popen -and- communicate solution (minus main and the rest of your installation):

 import io import subprocess import tempfile def are_text_files_equal(file0, file1): ''' Both files must be opened in "update" mode ('+' character), so they can be rewound to their beginnings. Both files will be read until just past the first differing line, or to the end of the files if no differences were encountered. ''' file0.seek(io.SEEK_SET) file1.seek(io.SEEK_SET) for line0, line1 in zip(file0, file1): if line0 != line1: return False # Both files were identical to this point. See if either file # has more data. next0 = next(file0, '') next1 = next(file1, '') if next0 or next1: return False return True def compare_subprocess_output(exe_path, input_path): with tempfile.TemporaryFile(mode='w+t', encoding='utf8') as temp_file: with open(input_path, 'r+t') as input_file: p = subprocess.Popen( [exe_path], stdin=input_file, stdout=temp_file, # No more PIPE. stderr=subprocess.PIPE, # <sigh> universal_newlines=True, ) err = p.communicate()[1] # No need to store output. # Compare input and output files... This must be inside # the `with` block, or the TemporaryFile will close before # we can use it. if are_text_files_equal(temp_file, input_file): print('OK') else: print('Failed: ' + str(err)) return

Unfortunately, since I cannot reproduce your problem, even with a million input, I cannot say if this works. If nothing else, this should give you the wrong answers faster.

Option: regular file

If you want to save the output of your test run to foo.txt (from the example on the command line), you must redirect the output of your subprocess to a regular file instead of TemporaryFile . This solution is recommended in JF Sebastian's answer .

I can’t say for your question if you want foo.txt , or if it is just a side effect of a two-step test, and then diff - your example in the command line saves the test result for a file, and your Python script does not. Saving the output would be convenient if you ever want to investigate a test failure, but for this you need to create a unique file name for each test you run so that they do not overwrite each other.

+1

Kevin J. Chase Jun 09 '16 at 10:07

source share

jfs · Accepted Answer · 2016-06-10T15:21:34+0000

To play a shell command:

 subprocess.run("dummy.exe < file.txt > foo.txt", shell=True, check=True)

without shell in Python:

 with open('file.txt', 'rb', 0) as input_file, \ open('foo.txt', 'wb', 0) as output_file: subprocess.run(["dummy.exe"], stdin=input_file, stdout=output_file, check=True)

It works with arbitrary large files.

You can use subprocess.check_call() in this case (available since Python 2) instead of subprocess.run() , which is only available in Python 3.5+.

It works well. But then why the original mistake? Pipe buffer size, as in Kevin's answer?

It has nothing to do with operating system buffers. A warning from the subprocess reports that the quoted @Kevin J. Chase is not related to subprocess.run() . You should only take care of OS buffer buffers if you use process = Popen() and manually read () / write () through multiple thread streams ( process.stdin/.stdout/.stderr ).

It turns out that the observed behavior is related to a Windows Error in universal CRT . Here's the problem that reproduces without Python: Why does redirection work where pipeline fails?

As stated in the error description , circumventing it:

"Use the binary channel and perform a text mode CRLF => LF-translation manually from the reader" or use ReadFile() directly instead of std::cin
or wait for the Windows 10 update this summer (where the error should be fixed)
or use another C ++ compiler, for example, there is no problem if you use g++ on Windows

The error affects only text channels, i.e. the code that uses <> must be accurate ( stdin=input_file, stdout=output_file should work, or is this some other error).

Why is the output of subprocess.run different from the output of the shell of the same command?

Problems with subprocess.PIPE and the family

Alternative: tempfile.TemporaryFile

Option: regular file

More articles:

Problems with `subprocess.PIPE` and the family

Alternative: `tempfile.TemporaryFile`