How to start pdftk subprocess during wsgi?

Question

How to start pdftk subprocess during wsgi?

I need to start the pdftk process while serving a web request in Django and wait for it to complete. My current pdftk code is as follows:

proc = subprocess.Popen(["/usr/bin/pdftk", "/tmp/infile1.pdf", "/tmp/infile2.pdf", "cat", "output", "/tmp/outfile.pdf"]) proc.communicate()

This works fine while I do the work under the dev server (it works as the www-data user). But as soon as I switch to mod_wsgi without changing anything, the code hangs on proc.communicate() , and "outfile.pdf" remains the open file descriptor.

I tried several options for calling the subprocess (like the usual old os.system) - setting stdin / stdout / stderr in PIPE or on various file descriptors does not change anything. Using "shell = True" prevents proc.communicate() from freezing, but then pdftk cannot create an output file in both devserver and mod_wsgi. This discussion seems to indicate that there might be some deeper voodoo coming with OS and pdftk signals that I don't understand.

Are there any workarounds to get the subprocess call, for example, to work correctly in wsgi? I avoid using PyPDF to merge PDF files because I need to merge a sufficiently large number of files (several hundred) in which memory runs out (PyPDF should keep every source of the pdf file open in memory when combining them).

I do this under recent Ubuntu, pythons 2.6 and 2.7.

+8

python django subprocess pdftk mod-wsgi

user85461 Sep 25 '11 at 3:26

source share

2 answers

Update: Combining two Pdfs with Pdftk in Python 3:

Several years have passed since the publication of this issue. (2011). The original poster said that the os.system did not work for them when they were running older versions of python:

Python 2.6 and
Python 2.7

In Python 3.4 , the os.system worked for me:

import os
os.system ("pdftk" + template_file + "fill_form" + data_file + "output" + export_file)

Python 3.5 adds subprocess.run

subprocess.run ("pdftk" + template_file + "fill_form" + data_file + "output" + export_file)
I used absolute paths for my files:
- template_file = "/ var / www / myproject / static /"

I ran this with Django 1.10 and the result was saved in export.file.

How to merge two PDF files and display the output in PDF format:

 from django.http import HttpResponse, HttpResponseNotFound from django.core.files.storage import FileSystemStorage from fdfgen import forge_fdf import os template_file = = "/var/www/myproject/template.pdf" data_file = "/var/www/myproject/data.fdf" export_file ="/var/www/myproject/pdf_output.pdf" fields = {} fields['organization_name'] = organization_name fields['address_line_1'] = address_line_1 fields['request_date'] = request_date fields['amount'] = amount field_list = [(field, fields[field]) for field in fields] fdf = forge_fdf("",field_list,[],[],[]) fdf_file = open(data_file,"wb") fdf_file.write(fdf) fdf_file.close() os.system("pdftk " + template_file + " fill_form " + data_file + " output " + export_file) time.sleep(1) fs = FileSystemStorage() if fs.exists(export_file): with fs.open(export_file) as pdf: return HttpResponse(pdf, content_type='application/pdf; charset=utf-8') else: return HttpResponseNotFound('The requested pdf was not found in our server.')

Libraries:

0

Tim Langeman Mar 24 '17 at 12:39 on

source share

Graham Dumpleton · Accepted Answer · 2011-09-25 04:33

Try using absolute file system paths to input and output files. The current working directory under Apache will not be the same directory as the launch server, and can be anything.

The second attempt after eliminating the obvious.

The pdftk program is a Java program that relies on the ability to generate / receive a SIGPWR signal to trigger garbage collection or other actions. The problem is that in the Apache / mod_wsgi daemon mode, signals are blocked in the request handler threads to ensure that they are only received by the main thread, looking for trigger events for the process to terminate. When you start a process to start pdftk, it unfortunately inherits a locked sigmask from the request handler thread. The consequence of this is that it interferes with the Java garbage collection process and causes pdftk to crash in strange ways.

The only solution for this is to use celery, and before it you need to send the task to the celery queue for celery, then to develop and execute pdftk. Since this is done from a process created other than Apache, you will not have this problem.

For more information about Google for mod_wsgi and pdftk, specifically on Google Groups.

http://groups.google.com/group/modwsgi/search?group=modwsgi&q=pdftk&qt_g=Search+this+group

How to start pdftk subprocess during wsgi?

Update: Combining two Pdfs with Pdftk in Python 3:

How to merge two PDF files and display the output in PDF format:

More articles: