I need to parse a pdf document. I already implemented the parser and used the iText library, and so far it has worked without problems.
But I do not need to parse another document, which becomes a very strange gap in the middle of the words. As an example, I get:
Vo rber eitung auf die Motorr adsaison . Viele Motorr adf ahr er
All bold words must be connected, but somehow the Parser PDF adds spaces to the words. But when I copy and paste the contents of the PDF into a text file, I do not get these spaces.
At first I thought about this because of the Parsing PDF library that I use, but also with a different library I get the same problem.
I looked at singleSpaceWidth from the parsed words, and I noticed that it always changes when it adds a space. I tried to assemble them manually. But since there really is no template for recombining words, it is almost impossible.
Does anyone have a similar problem or even a solution to this problem?
As requested, here is more information:
Analysis using SemTextExtractionStrategy:
PdfReader reader = new PdfReader("data/SpecialTests/SuedostSchweiz/" + src); SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy(); for (int i = 1; i <= reader.getNumberOfPages(); i++) {
Here's the SemTextExtractionStrategy method, which actually parses the text. There, I manually add a space after each parsed word, but somehow it separates the words in the detection:
@Override public void parseText(TextRenderInfo renderInfo, int pageNumber) { this.pageNumber = pageNumber; String text = renderInfo.getText(); currTextBlock.getText().append(text + " "); .... }
Here is the entire SemTextExtraction Class, but there it only calls the method on top (parseText):
public class SemTextExtractionStrategy implements TextExtractionStrategy {