Reading PDF using itextsharp, where the PDF is not English

Question

Reading PDF using itextsharp, where the PDF is not English

I am trying to read this pdf using itextsharp in C #, which converts this pdf to a word file. it is also necessary to support the formation of tables and fonts in the word when I try to use English pdf, it will work fine, but using some Indian languages, such as Hindi, Marathi, it does not work.

public string ReadPdfFile(string Filename) { string strText = string.Empty; StringBuilder text = new StringBuilder(); try { PdfReader reader = new PdfReader((string)Filename); if (File.Exists(Filename)) { PdfReader pdfReader = new PdfReader(Filename); for (int page = 1; page <= pdfReader.NumberOfPages; page++) { ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); text.Append(currentText); pdfReader.Close(); } } } catch (Exception ex) { MessageBox.Show(ex.Message); } textBox1.Text = text.ToString(); return text.ToString(); ; }

+5

.net ms-word pdf itextsharp c # -4.0

Rahul Rajput Mar 13 '13 at 12:24

source share

3 answers

As @mkl said, we need more information on why something is not working. But I can tell you a couple of things that can help you.

Firstly, SimpleTextExtractionStrategy very simple . If you read the documents for this, you will see that:

If the PDF does not display text from top to bottom, this will result in the text not being an accurate representation of how it looks in the PDF.

This means that although the PDF file may look like it needs to be read from top to bottom, it could be written in a different order. In the PDF file you are referencing, the second visual line is written first. Check out my post here for a slightly smarter text extraction strategy that tries to bring text back down. When I run my code on the first page of your PDF file, it seems to correctly stretch each "line".

Secondly, PDF files do not have the concept of tables. They simply have text and lines drawn in specific places, and none of them are connected to each other. This means that you will need to calculate each row and build your own table concept, you will not find any code in iTextSharp that does this for you. Personally, I would not even try to write one.

Thirdly, text extraction is designed to extract text that has nothing to do with fonts. If you want this, you will have to build this logic in yourself. Check out my post here for a start.

+4

Chris Haas Mar 13 '13 at 14:11

source share

@Rahul Rajput Have you fixed this problem? I am facing this problem. Can you help me with your strategy?

0

Tushar Dhamal May 25 '19 at 14:55

source share

mkl · Accepted Answer · 2013-03-22 09:27

I checked your file with special emphasis on your sample “मतद | र”, which is extracted as “मतदरर” at the very top line of the document pages.

In a nutshell:

The document itself contains information, for example. the glyphs "मतद | र" in the title bar represent the text "मतदरर". You should ask the source of your document for the version of the document in which the font information is not misleading. If this is not possible, you should go for OCR.

More details:

The top line of the first page is generated by the following operations in the page content stream:

 /9 280 Tf (-12"!%$"234%56*5) Tj

The first line selects a font with the name / 9 with a size of 280 (the operation at the beginning of the page scales everything by 0.05 times, so the effective size is 14 units, which you see in the file).

The second line prints glyphs. These glyphs are referenced between brackets using the custom encoding of this font.

When a program tries to extract text, it should output the actual characters from these glyph links using information from the font.

The font / 9 on the first page of your PDF file is defined using the following objects:

 242 0 obj<< /Type/Font/Name/9/BaseFont 243 0 R/FirstChar 33/LastChar 94 /Subtype/TrueType/ToUnicode 244 0 R/FontDescriptor 247 0 R/Widths 248 0 R>> endobj 243 0 obj/CDAC-GISTSurekh-Bold+0 endobj 247 0 obj<< /Type/FontDescriptor/FontFile2 245 0 R/FontBBox 246 0 R/FontName 243 0 R /Flags 4/MissingWidth 946/StemV 0/StemH 0/CapHeight 500/XHeight 0 /Ascent 1050/Descent -400/Leading 0/MaxWidth 1892/AvgWidth 946/ItalicAngle 0>> endobj

Thus, there is no / Encoding element, but at least there is a link to the / ToUnicode card . Thus, the program that extracts the text should rely on this mapping / ToUnicode .

The stream referenced by / ToUnicode contains the following mappings of interest when extracting text from (-12 "!% $" 234% 56 * 5):

 <21> <21> <0930> <22> <22> <0930> <24> <24> <091c> <25> <25> <0020> <2a> <2a> <0031> <2d> <2d> <092e> <31> <31> <0924> <32> <32> <0926> <33> <33> <0926> <34> <34> <002c> <35> <35> <0032> <36> <36> <0030>

(Already here you can see that several character codes are mapped to the same Unicode code point ...)

Thus, text extraction should result in:

 - = 0x2d -> 0x092e = म 1 = 0x31 -> 0x0924 = त 2 = 0x32 -> 0x0926 = द " = 0x22 -> 0x0930 = र instead of | ! = 0x21 -> 0x0930 = र % = 0x25 -> 0x0020 = $ = 0x24 -> 0x091c = ज " = 0x22 -> 0x0930 = र 2 = 0x32 -> 0x0926 = द 3 = 0x33 -> 0x0926 = द 4 = 0x34 -> 0x002c = , % = 0x25 -> 0x0020 = 5 = 0x35 -> 0x0032 = 2 6 = 0x36 -> 0x0030 = 0 * = 0x2a -> 0x0031 = 1 5 = 0x35 -> 0x0032 = 2

Thus, the text iTextSharp (as well as Adobe Reader!) Extracts from the title on the first page of the document exactly what is true in the document in font applications.

The reason for this is misleading information about the mapping in the font definition, it is not surprising that there are incorrect interpretations throughout the document.

Reading PDF using itextsharp, where the PDF is not English

More articles: