C # for rendering PDF files and OCRing received images?

Question

C # for rendering PDF files and OCRing received images?

I am looking for a C # solution to import data from PDF documents into our database in a commercial application. Our customers will strive to import any arbitrary document. Usually I write this as a complete impossibility, but the documents they import will be in their own layout.

My plan is for PDF files to be displayed on still images, and then allow users to customize their own templates, which essentially pull out the text at the predefined pixel offsets in the PDF using OCR. For tables, they determine the location of the table and a bunch of additional values for the sizes of columns and rows. Then we can apply the template to this type of document.

So, what I'm really looking for is two libraries: one for converting PDF files to images, the other for OCR images.

Requirements:

Is pure C # or has a supported C # shell in its native DLL.
It does not break processes - wrappers that essentially just create command line parameters and run an external executable file are not allowed in this case.
In the case of FOSS, it allows us to free ourselves from the usual FOSS licensing requirements (i.e. by publishing our source code) by paying a license fee.

Of course, we are not opposed to paying for a commercial solution, but we would prefer not to focus on paying for individual software distribution.

I know that this is a rather specific set of requirements - perhaps some people find this question too localized, but I hope someone can offer an approach and some libraries that may help me, as well as others in the future.

The material I learned for the PDF page:

iTextSharp - Documentation is a book you should buy, not a good start. This seems to be not very useful documentation regarding turning PDF files into public domain images. Licensing is opaque, it looks like we have to pay for every customer we distribute.
Docotic.Pdf - Only the text, we do not need.
pdftohtml - Again, does not create images. It would be a mess to port in C # too.
PdfFileParser - Still not what we need.
GhostScript - We really like what we want, but require a transition to the program.

For the OCR side, I probably end up using Tesseract, since the Apache license is resolvable and received good reviews. If there is an alternative, I will be interested too.

+4

c # pdf ocr pdf-rendering

Polynomial May 31 '12 at 10:38

source share

2 answers

I would recommend Amyuni PDF Creator. Net for this task.

1st scenario:
If your PDFs are well defined (missing font information, etc.), you can directly extract text from the PDF by specifying a rectangular area in the GetObjectsInRectangle method. You should also use the acGetRectObjectsOptimize option:

Optimize text objects before returning them. That is, combine text objects close to each other into one text object.

2nd scenario:
If there are images involved that also contain text, displaying the entire page in an image and then applying OCR may be a better choice. You can do this with Amyuni PDF Creator.Net using the ExportToTiff , ExportToJPeg, or RasterizePageRange methods.

From the documentation:

IacDocument.RasterizePageRange method The RasterizePageRange method converts the contents of a page into a color or gray image. when archiving documents or performing OCR, sometimes all pages that will be stored as images are preferred over complex text and graphic operations.

Then you can use the OCR inscription , which integrates with Tesseract OCR and, finally, we again fall into the 1st script (GetObjectsInRectangle). To apply OCR to your files, you can use the OCRPageRange method.

void OCRPageRange (int startPage, int EndPage, string Language, acOCROptions Options)

On licensing, Amyuni PDF Creator.Net provides a free license (for each application).

Generally disclaimer applies

+2

yms May 31 '12 at 13:30

source share

Bobrovsky · Accepted Answer · 2012-05-31T17:58:06+0000

I think you might want to give Docotic.Pdf another chance.

The library can extract text fragments, words, and even individual characters with their bounding boxes. Please look at the sample for extracting words from PDF files .

In addition, Docotic.Pdf can create images from PDF files and draw pages on System.Drawing.Graphics . See Draw and print a group of Pdf samples .

Disclaimer: I am one of the developers of the library.

C # for rendering PDF files and OCRing received images?

More articles: