How to filter chat messages by normalizing letter shapes?

Question

How to filter chat messages by normalizing letter shapes?

I filter chat messages in the chat system, where it is desirable to bind strings to Latin-1 English. Users tend to use creative input, for example.

ßòógīě§

instead

 Boogies

In Java, there are Unicode normalization methods that can remove diacritics, but I'm more interested in normalizing letter forms in English and the Latin-1 character set.

Are there any tables, libraries, or methods that can display common Unicode characters outside Latin-1 in their nearest forms visually? For instance.

 ß -> B § -> S ¥ -> Y ¤ -> o

I suspect the answer is "No, that would be too big, just filter them all out," but I can hope ...

+4

java text unicode character-encoding filtering

izb Oct 11 '10 at 9:10

source share

2 answers

aioobe · Answer 1 · 2010-10-11T09:16:11+0000

I think it’s best to use the OCR engine (Optical Character Recognition). In the end, this is exactly what you need: it is best to parse the letters into readable AZ characters. (Remember to print chat messages on the image using the same font as in your chat client.)

Two Java-OCR libraries:

Michael borgwardt · Answer 2 · 2010-10-11T09:19:20+0000

The right decision is not to install idiotic "profanity filters" (which, I believe, is behind this request). If the community cannot behave independently in this regard, mitigate it manually and ban criminals or shut it down. To deal with the Scunthorpe problem, it offends your users far more than some cursing children.

How to filter chat messages by normalizing letter shapes?

More articles: