I have some documents that went through the conversion of OCR from PDF to HTML. Because of this, they ended up with a lot of random characters from unicode presets, where the converter got messed up (i.e. Elipses, etc.). They also correctly have a bunch of non-English, but still alphabetic characters, such as Γ©, and Russian characters, etc.
Is there a way to make a regex that will match any character in the Unicode alphabet (from alphabets of any language)? Or one that matches only non-alphabets? Any of them would be very helpful and awesome. I use Perl if that changes anything. Thanks!
Eli
source share