I am looking for a C # solution to import data from PDF documents into our database in a commercial application. Our customers will strive to import any arbitrary document. Usually I write this as a complete impossibility, but the documents they import will be in their own layout.
My plan is for PDF files to be displayed on still images, and then allow users to customize their own templates, which essentially pull out the text at the predefined pixel offsets in the PDF using OCR. For tables, they determine the location of the table and a bunch of additional values ββfor the sizes of columns and rows. Then we can apply the template to this type of document.
So, what I'm really looking for is two libraries: one for converting PDF files to images, the other for OCR images.
Requirements:
- Is pure C # or has a supported C # shell in its native DLL.
- It does not break processes - wrappers that essentially just create command line parameters and run an external executable file are not allowed in this case.
- In the case of FOSS, it allows us to free ourselves from the usual FOSS licensing requirements (i.e. by publishing our source code) by paying a license fee.
Of course, we are not opposed to paying for a commercial solution, but we would prefer not to focus on paying for individual software distribution.
I know that this is a rather specific set of requirements - perhaps some people find this question too localized, but I hope someone can offer an approach and some libraries that may help me, as well as others in the future.
The material I learned for the PDF page:
- iTextSharp - Documentation is a book you should buy, not a good start. This seems to be not very useful documentation regarding turning PDF files into public domain images. Licensing is opaque, it looks like we have to pay for every customer we distribute.
- Docotic.Pdf - Only the text, we do not need.
- pdftohtml - Again, does not create images. It would be a mess to port in C # too.
- PdfFileParser - Still not what we need.
- GhostScript - We really like what we want, but require a transition to the program.
For the OCR side, I probably end up using Tesseract, since the Apache license is resolvable and received good reviews. If there is an alternative, I will be interested too.
source share