I want to remove accents and more general diacritical marks from a string in order to initiate a search without an accent. Based on some readings in Unicode character classes, I came up with the following:
QString unaccent(const QString s) { QString s2 = s.normalized(QString::NormalizationForm_D); QString out; for (int i=0,j=s2.length(); i<j; i++) {
Apparently, it works well enough for Latin languages, but Iโm interested in its adequacy in other alphabets: Arabic, Cyrillic, CJK ... which I canโt check because of a lack of cultural understanding of these.
In particular, I would like to know:
- Which form of Unicode normalization is best suited for this problem:
NormalizationForm_KD or NormalizationForm_D ? - Is it enough to remove characters belonging to the categories
Mark_NonSpacing and Mark_SpacingCombining , or to include more categories in it? - Are there other improvements to the above code that will make the work as possible as possible for all languages?
source share