What is a quick and uncontrolled way to check the quality of the extracted pdf text?

I am working on a slightly larger body with tens of thousands of articles. I am currently using PDFBox for extraction with great success, and I am looking for a way to programmatically check each file to make sure that the extraction was moderately successful or not. Iโ€™m thinking about starting spellchecking on each of them, but the language may differ, I donโ€™t know yet which languages โ€‹โ€‹I speak. Recognizing natural language with grades can also be an idea.

Oh, and any method should also combine well with Java, be fast and relatively fast for integration.

+4
source share
3 answers

Try learning spell checking automatically. This is not as scary as it seems. Start with a large vocabulary containing all the words you are likely to come across. It can be several languages.

When scanning a PDF, a certain amount of unknown words is allowed (say, 5%). If any of these words is repeated often enough (say, 5 times), add them to the dictionary. If the PDF contains more than 5% of unknown words, it is very likely something that cannot be processed.

The scanner will learn over time, allowing you to reduce the number of unknown words, if necessary. If this is too much, a dictionary too large should work well.

If you do not have a dictionary, manually process several documents and find out about this scanner. After a dozen files or so, your new dictionary should be large enough for a reasonable water level.

+2
source

You can simply run the corpus against a list of stop words (the most common words that search engines ignore, such as "and" and "the"), but then you obviously need to stop word lists for all possible / probable languages โ€‹โ€‹first.

+2
source

Of course, no method would be perfect.

There are usually two classes of text extracts:

1 - nothing is retrieved. This may be due to the fact that you have a scanned document or something is invalid in the PDF file.

It is usually easy to find, you do not need complex code to test them.

2 - You get garbage. Most of the time, because the PDF file is strangely encoded. This may be due to improvised encoding not properly declared, or, perhaps, the PDF author needs characters that are not recognized by the PDF (for example, Turkish S with cedilla was absent from the adobe glyph list for some time: you could not create a correctly encoded file with it inside, so you had to cheat to get visually on the page).

I use the ngram-based method to detect the languages โ€‹โ€‹of PDF files based on the extracted text (with different technologies, but the idea is the same). Files in which the language has not been recognized are usually good suspects in the problem ...

About spell checking I suppose this will give you a ton of false positives, especially if you have multiple languages!

+1
source

All Articles