Of course, no method would be perfect.
There are usually two classes of text extracts:
1 - nothing is retrieved. This may be due to the fact that you have a scanned document or something is invalid in the PDF file.
It is usually easy to find, you do not need complex code to test them.
2 - You get garbage. Most of the time, because the PDF file is strangely encoded. This may be due to improvised encoding not properly declared, or, perhaps, the PDF author needs characters that are not recognized by the PDF (for example, Turkish S with cedilla was absent from the adobe glyph list for some time: you could not create a correctly encoded file with it inside, so you had to cheat to get visually on the page).
I use the ngram-based method to detect the languages โโof PDF files based on the extracted text (with different technologies, but the idea is the same). Files in which the language has not been recognized are usually good suspects in the problem ...
About spell checking I suppose this will give you a ton of false positives, especially if you have multiple languages!
source share