Slugify and transliteration of characters in C #

I am trying to translate the following slugify method from PHP to C #: http://snipplr.com/view/22741/slugify-a-string-in-php/

Edit: For convenience, here is the code above:

/** * Modifies a string to remove al non ASCII characters and spaces. */ static public function slugify($text) { // replace non letter or digits by - $text = preg_replace('~[^\\pL\d]+~u', '-', $text); // trim $text = trim($text, '-'); // transliterate if (function_exists('iconv')) { $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text); } // lowercase $text = strtolower($text); // remove unwanted characters $text = preg_replace('~[^-\w]+~', '', $text); if (empty($text)) { return 'n-a'; } return $text; } 

I had no problem coding the rest, except that I cannot find the C # equivalent of the following line of PHP code:

 $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text); 

Edit: The purpose of this is to translate non-ASCII characters such as Reformáció Genfi Emlékműve Előtt to reformacio-genfi-emlekmuve-elott

+12
c # internationalization slug transliteration
Jan 31 '10 at 23:18
source share
3 answers

I would also like to add that //TRANSLIT removes apostrophes and this @jxac solution does not address this. I'm not sure why, but by first encoding it in Cyrillic and then in ASCII, you get the same behavior as //TRANSLIT .

 var str = "éåäöíØ"; var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str)); => "eaaoiO" 
+11
Jan 31 '10 at 23:49
source share

There is a .NET library for transliteration into codeplex - unidecode . This is usually a trick using Unidecode tables migrated from python.

+8
Jul 15 '10 at 13:18
source share

conversion to string:

 byte[] unicodeBytes = Encoding.Unicode.GetBytes(str); byte[] asciiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, unicodeBytes); string asciiString = Encoding.ASCII.GetString(asciiBytes); 

conversion to bytes:

 byte[] ascii = Encoding.ASCII.GetBytes(str); 

@Thomas Levesque is right, will get encoding by the output stream ...

to remove diacritics (accent marks), you can use the String.Normalize function, as described here:

http://www.siao2.com/2007/05/14/2629747.aspx

which should take care of most cases (where the glyph is really a symbol plus an accent sign). for a more aggressive char mapping (to take care of cases like Scandinavian slashed o [Ø], digraphs and other exotic glyphs), there is an approach to the table:

http://www.codeproject.com/KB/cs/UnicodeNormalization.aspx

this includes about 1000 character mappings in addition to normalization.

(note that all punctuation is removed by replacing the regular expression in your example)

+1
Jan 31 '10 at 23:36
source share



All Articles