Reading PDF content using iTextSharp in C #

Question

Reading PDF content using iTextSharp in C #

I use this code to read pdf content using iTextSharp. It works fine when the content is English, but it does not work when the content is Persian or Arabic. The result looks something like this:
Here is a sample non-English PDF for the test.

ÙŽ → Ü † Ø§ Ù '"Ø¨ ~~ Ø · Ø" ÛŒØ¿ÛŒ "> Ù ~ Ø²Ø¾Ø§ Ù Ù Ù Ø Ù Ù Ù Ù Ù ... Ø ÛŒ'" Ø¨ '• Ø³ Â © Karl Seguin foppersian.codeplex.com www.codebetter.com 1 1 Ù '' Ø¨ ~~ Ø · Ø "ÙŽ"> Ü † Ø§ ÛŒØ¿ÛŒÛŒ> ~~
Ù‡Ù…Ø§Ù†Ø±Ø¨ Ù„ÙˆØµØ§ ÛŒØ³ÛŒÙˆÙ† Ù…Ø±Ù† Ø¯ÛŒÙ„ÙˆØª Ø±ØªÙ‡Ø¨ Ø±Ø§Ø²ÙØ§ 

What solution?

  public string ReadPdfFile(string fileName) { StringBuilder text = new StringBuilder(); if (File.Exists(fileName)) { PdfReader pdfReader = new PdfReader(fileName); for (int page = 1; page <= pdfReader.NumberOfPages; page++) { ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText))); text.Append(currentText); pdfReader.Close(); } } return text.ToString(); }

+4

c # pdf itextsharp

Shahin Apr 17 2018-12-12T00:

source share

1 answer

Chris Haas · Accepted Answer · 2012-04-17 13:07

In .Net, as soon as you have a string, you have a string , and this is Unicode, always . The actual implementation in memory is UTF-16, but that doesn't matter. Never, never, never decompose a string into bytes and do not try to re-interpret it as another encoding and delete it as a string, because this makes no sense and almost always fails.

Your problem in this line:

 currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));

I am going to split it into a couple of lines to illustrate:

 byte[] bytes = Encoding.UTF8.GetBytes("ی"); //bytes now holds 0xDB8C byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);//converted now holds 0xC39BC592 string final = Encoding.UTF8.GetString(converted);//final now holds ÛŒ

The code will mix anything above 127 ASCII barriers. Drop the re-encoding string and you should be good.

A side note, it is possible that everything that creates a string does it wrong, which is not so rare. But you need to fix this problem before , it will become string , at byte level.

EDIT

The code should be the same as yours, except that one line must be deleted. Also, no matter what you use to display the text, make sure it supports Unicode. Also, as @kuujinbo said, make sure you are using the latest version of iTextSharp. I tested this with 5.2.0.0.

  public string ReadPdfFile(string fileName) { StringBuilder text = new StringBuilder(); if (File.Exists(fileName)) { PdfReader pdfReader = new PdfReader(fileName); for (int page = 1; page <= pdfReader.NumberOfPages; page++) { ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); text.Append(currentText); } pdfReader.Close(); } return text.ToString(); }

EDIT 2

The above code fixes the encoding problem, but does not fix the order of the lines themselves. Unfortunately, this problem seems to be at the PDF level itself.

Therefore, displaying text in such writing systems from right to left requires either positioning each glyph individually (which is tedious and expensive) or representing text with exponential lines (see 9.2, “Organizing and using fonts”), whose character codes are in the reverse order.

PDF Specification 2008 - 14.8.2.3.3 - Reverse Ordering Lines

When reordering strings, for example, above, the content (if I understand the specification correctly) should use the "tagged content" section, BMC . However, a few examples of the PDF files that I have looked at and generated do not actually do this. I am absolutely mistaken in this part because it is not my specialty, therefore you should think more so.

Reading PDF content using iTextSharp in C #

More articles: