Removing Accents from QString

I want to remove accents and more general diacritical marks from a string in order to initiate a search without an accent. Based on some readings in Unicode character classes, I came up with the following:

QString unaccent(const QString s) { QString s2 = s.normalized(QString::NormalizationForm_D); QString out; for (int i=0,j=s2.length(); i<j; i++) { // strip diacritic marks if (s2.at(i).category()!=QChar::Mark_NonSpacing && s2.at(i).category()!=QChar::Mark_SpacingCombining) { out.append(s2.at(i)); } } return out; } 

Apparently, it works well enough for Latin languages, but Iโ€™m interested in its adequacy in other alphabets: Arabic, Cyrillic, CJK ... which I canโ€™t check because of a lack of cultural understanding of these.

In particular, I would like to know:

  • Which form of Unicode normalization is best suited for this problem: NormalizationForm_KD or NormalizationForm_D ?
  • Is it enough to remove characters belonging to the categories Mark_NonSpacing and Mark_SpacingCombining , or to include more categories in it?
  • Are there other improvements to the above code that will make the work as possible as possible for all languages?
+6
source share
1 answer
 QString unaccent(const QString s) { QString output(s.normalized(QString::NormalizationForm_D)); return output.replace(QRegExp("[^a-zA-Z\\s]"), ""); } 
0
source

Source: https://habr.com/ru/post/924614/


All Articles