In general, you have no chance to reliably differentiate between "background" and "real" text. The text is drawn somewhere on the page in some order, and what is front, background, normal text, ..., is a matter of human perception and may not be reflected at all in the structure of the PDF content stream.
You can try some enlightened guesses, for example. assuming that the "real" text has strong colors, while the background text is in lighter colors, or the "real" text is in horizontal lines, while the background text is often more diagonal, etc. But this is a hunch that you canβt rely on for sure.
On the other hand, in the case of tagged PDF files, you may have a chance, the watermark may be marked as artifact data.
PS I just saw you sharing your file again. In the case of your document, the heuristic I spoke of will work, the background text is grayish and is printed diagonally.
Thus, during the scan, you need to track the fill color and / or the transformation matrix. As soon as the scanner finds the text, you know if it is a background or foreground based on the current color value and / or matrix.
Remember that this is not so easy with all documents.
source share