Strange spaces when parsing PDF

I need to parse a pdf document. I already implemented the parser and used the iText library, and so far it has worked without problems.

But I do not need to parse another document, which becomes a very strange gap in the middle of the words. As an example, I get:

Vo rber eitung auf die Motorr adsaison . Viele Motorr adf ahr er

All bold words must be connected, but somehow the Parser PDF adds spaces to the words. But when I copy and paste the contents of the PDF into a text file, I do not get these spaces.

At first I thought about this because of the Parsing PDF library that I use, but also with a different library I get the same problem.

I looked at singleSpaceWidth from the parsed words, and I noticed that it always changes when it adds a space. I tried to assemble them manually. But since there really is no template for recombining words, it is almost impossible.

Does anyone have a similar problem or even a solution to this problem?

As requested, here is more information:

Analysis using SemTextExtractionStrategy:

 PdfReader reader = new PdfReader("data/SpecialTests/SuedostSchweiz/" + src); SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy(); for (int i = 1; i <= reader.getNumberOfPages(); i++) { // Set the page number on the strategy. Is used in the Parsing strategies. semTextExtractionStrategy.pageNumber = i; // Parse text from page PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy); } 

Here's the SemTextExtractionStrategy method, which actually parses the text. There, I manually add a space after each parsed word, but somehow it separates the words in the detection:

 @Override public void parseText(TextRenderInfo renderInfo, int pageNumber) { this.pageNumber = pageNumber; String text = renderInfo.getText(); currTextBlock.getText().append(text + " "); .... } 

Here is the entire SemTextExtraction Class, but there it only calls the method on top (parseText):

 public class SemTextExtractionStrategy implements TextExtractionStrategy { // Text Extraction Strategies public ColumnDetecter columnDetecter = new ColumnDetecter(); // Image Extraction Strategies public ImageRetriever imageRetriever = new ImageRetriever(); public int pageNumber = -1; public ArrayList<TextParsingStrategy> textParsingStrategies = new ArrayList<TextParsingStrategy>(); public ArrayList<ImageParsingStrategy> imageParsingStrategies = new ArrayList<ImageParsingStrategy>(); public SemTextExtractionStrategy() { // Add all text parsing strategies which are later on applied on the extracted text // textParsingStrategies.add(fontSizeMatcher); textParsingStrategies.add(columnDetecter); // Add all image parsing strategies which are later on applied on the extracted text imageParsingStrategies.add(imageRetriever); } @Override public void beginTextBlock() { } @Override public void renderText(TextRenderInfo renderInfo) { // TEXT PARSING for(TextParsingStrategy strategy : textParsingStrategies) { strategy.parseText(renderInfo, pageNumber); } } @Override public void endTextBlock() { } @Override public void renderImage(ImageRenderInfo renderInfo) { for(ImageParsingStrategy strategy : imageParsingStrategies) { strategy.parseImage(renderInfo); } } } 
+8
java pdf whitespace pdf-parsing itext
source share
3 answers

I processed this PDF file with the following Ghostscript command:

 gs -o out.pdf -q -sDEVICE=pdfwrite -dOptimize=false -dUseFlageCompression=false -dCompressPages=false -dCompressFonts=false whitespacesProblem.pdf 

This command created an out.pdf file that does not have stream encoding, so it is better read. The interesting part is on line 52, which I read for several lines:

 [ (&;&)-287.988 (672744)29.9906 (+\(%)30.01 (+!4)29.9876 (&4)-287.989 (%4)30.0039 (&1&8)-287.975 (3=\)!)-288.021 (*&4)30.0212 (&=23)-287.996 (+1%)-287.99 (\(=&)-288.011 (8&1&)-287.974 (672744)29.9906 (+\(3+=378$)-250.977 (#7\)!) ]TJ 

Between parentheses are text characters. I changed some of them and looked at the rendering of the PDF file to see which character the glyph represents. Then I decoded the text:

 [ (ele)-287.988 (Motorr)29.9906 *** (adf)30.01 *** (ahr)29.9876 *** (er)-287.989 (fr)30.0039 (euen)-287.975 (sich)-288.021 ... ] 

Thus, there are gaps between the characters. In your case, this is probably kerning a font. The question is how your PDF library interprets these spaces, and it seems to me that even "negative spaces" are displayed in space in the resulting string.

+2
source share

Simple spaces in pdf is a known issue described in Roland's answer, as well as the first comment https://issues.apache.org/jira/browse/TIKA-724

The answer that also worked for me is the one shown by huuhungus at https://github.com/smalot/pdfparser/issues/72

which is specific to PDFParser, and it should change the code that actually adds this extra space to PDFParser if you know that you will have this problem:

src / Smalot / PdfParser / Object.php comment on this line

  $text .= ' '; 

Not completely fix it, but it's acceptable

Other libraries may also have similar interim fixes, so in some cases they may help with this problem.

+1
source share

Since the document you have is split into columns, the obvious error is inside

SemTextExtractionStrategy

class. I assume that the ColumnDetecter class is the one that might be to blame, not iText. I can only assume that it is implemented based on the size of the column, and then extracts text based on this.

If you only need text, the implementation may be simpler, depending on the size of the column.

0
source share

All Articles