Removing Accents from QString

Question

Removing Accents from QString

I want to remove accents and more general diacritical marks from a string in order to initiate a search without an accent. Based on some readings in Unicode character classes, I came up with the following:

QString unaccent(const QString s) { QString s2 = s.normalized(QString::NormalizationForm_D); QString out; for (int i=0,j=s2.length(); i<j; i++) { // strip diacritic marks if (s2.at(i).category()!=QChar::Mark_NonSpacing && s2.at(i).category()!=QChar::Mark_SpacingCombining) { out.append(s2.at(i)); } } return out; }

Apparently, it works well enough for Latin languages, but I’m interested in its adequacy in other alphabets: Arabic, Cyrillic, CJK ... which I can’t check because of a lack of cultural understanding of these.

In particular, I would like to know:

Which form of Unicode normalization is best suited for this problem: NormalizationForm_KD or NormalizationForm_D ?
Is it enough to remove characters belonging to the categories Mark_NonSpacing and Mark_SpacingCombining , or to include more categories in it?
Are there other improvements to the above code that will make the work as possible as possible for all languages?

+6

qt unicode

Daniel Vérité Sep 05 '12 at 9:35

source share

1 answer

Heitor · Answer 1 · 2012-10-09T18:48:29+0000

 QString unaccent(const QString s) { QString output(s.normalized(QString::NormalizationForm_D)); return output.replace(QRegExp("[^a-zA-Z\\s]"), ""); }

Removing Accents from QString

More articles: