IOS - distinguish between background text (watermark) and real text in PDF

I have a pdf with a watermark on its background. When you start scanning to highlight any word with a watermark or annotation in the background, it is selected as it is found first in the touch area.

I use CGPDFScanner to scan text.

My question is how to detect if the scanned text is text in the background or in real text in PDF format? How can I distinguish between standard text and annotation text?

Thanks.

+5
source share
1 answer

In general, you have no chance to reliably differentiate between "background" and "real" text. The text is drawn somewhere on the page in some order, and what is front, background, normal text, ..., is a matter of human perception and may not be reflected at all in the structure of the PDF content stream.

You can try some enlightened guesses, for example. assuming that the "real" text has strong colors, while the background text is in lighter colors, or the "real" text is in horizontal lines, while the background text is often more diagonal, etc. But this is a hunch that you can’t rely on for sure.

On the other hand, in the case of tagged PDF files, you may have a chance, the watermark may be marked as artifact data.

PS I just saw you sharing your file again. In the case of your document, the heuristic I spoke of will work, the background text is grayish and is printed diagonally.

Thus, during the scan, you need to track the fill color and / or the transformation matrix. As soon as the scanner finds the text, you know if it is a background or foreground based on the current color value and / or matrix.

Remember that this is not so easy with all documents.

+3
source

All Articles