Slugify URL Algorithm in C #?

So, I searched and looked at the slug tag on SO and found only two convincing solutions:

  • Service transliteration in C #
  • How to convert super or index to plain text in C #

This is only a partial solution to the problem. I could have manually done this myself, but I am surprised that there is no solution yet.

So, is there an implementation of slugify alrogithm in C # and / or .NET that properly addresses Latin characters, unicode, and various other language problems?

+65
c # slug
May 27 '10 at 11:44
source share
5 answers

http://predicatet.blogspot.com/2009/04/improved-c-slug-generator-or-how-to.html

public static string GenerateSlug(this string phrase) { string str = phrase.RemoveAccent().ToLower(); // invalid chars str = Regex.Replace(str, @"[^a-z0-9\s-]", ""); // convert multiple spaces into one space str = Regex.Replace(str, @"\s+", " ").Trim(); // cut and trim str = str.Substring(0, str.Length <= 45 ? str.Length : 45).Trim(); str = Regex.Replace(str, @"\s", "-"); // hyphens return str; } public static string RemoveAccent(this string txt) { byte[] bytes = System.Text.Encoding.GetEncoding("Cyrillic").GetBytes(txt); return System.Text.Encoding.ASCII.GetString(bytes); } 
+128
May 27 '10 at 12:39 a.m.
source share

Here you will find a way to generate a URL in C # . This function removes all accents (Marcel’s answer), replaces spaces, removes invalid characters, trims dashes from the end and replaces double occurrences of β€œ-” or β€œ_”

the code:

 public static string ToUrlSlug(string value){ //First to lower case value = value.ToLowerInvariant(); //Remove all accents var bytes = Encoding.GetEncoding("Cyrillic").GetBytes(value); value = Encoding.ASCII.GetString(bytes); //Replace spaces value = Regex.Replace(value, @"\s", "-", RegexOptions.Compiled); //Remove invalid chars value = Regex.Replace(value, @"[^a-z0-9\s-_]", "",RegexOptions.Compiled); //Trim dashes from end value = value.Trim('-', '_'); //Replace double occurences of - or _ value = Regex.Replace(value, @"([-_]){2,}", "$1", RegexOptions.Compiled); return value ; } 
+16
Jan 26 '13 at 16:15
source share

Here is my performance, based on the answers of Joan and Marcel. The changes I made are as follows:

  • Use the generally accepted method to remove accents.
  • Explicit Regex caching to improve speed.
  • Other word separators are recognized and normalized to hyphens.

Here is the code:

 public class UrlSlugger { // white space, em-dash, en-dash, underscore static readonly Regex WordDelimiters = new Regex(@"[\s—–_]", RegexOptions.Compiled); // characters that are not valid static readonly Regex InvalidChars = new Regex(@"[^a-z0-9\-]", RegexOptions.Compiled); // multiple hyphens static readonly Regex MultipleHyphens = new Regex(@"-{2,}", RegexOptions.Compiled); public static string ToUrlSlug(string value) { // convert to lower case value = value.ToLowerInvariant(); // remove diacritics (accents) value = RemoveDiacritics(value); // ensure all word delimiters are hyphens value = WordDelimiters.Replace(value, "-"); // strip out invalid characters value = InvalidChars.Replace(value, ""); // replace multiple hyphens (-) with a single hyphen value = MultipleHyphens.Replace(value, "-"); // trim hyphens (-) from ends return value.Trim('-'); } /// See: http://www.siao2.com/2007/05/14/2629747.aspx private static string RemoveDiacritics(string stIn) { string stFormD = stIn.Normalize(NormalizationForm.FormD); StringBuilder sb = new StringBuilder(); for (int ich = 0; ich < stFormD.Length; ich++) { UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]); if (uc != UnicodeCategory.NonSpacingMark) { sb.Append(stFormD[ich]); } } return (sb.ToString().Normalize(NormalizationForm.FormC)); } } 



This still does not solve the problem of the non-Latin symbol. A completely alternative solution would be to use Uri.EscapeDataString to convert the string to hexadecimal:

 string original = "桋试公司"; // %E6%B5%8B%E8%AF%95%E5%85%AC%E5%8F%B8 string converted = Uri.EscapeDataString(original); 

Then use the data to create the hyperlink:

 <a href="http://www.example.com/100/%E6%B5%8B%E8%AF%95%E5%85%AC%E5%8F%B8">桋试公司</a> 

Many browsers will display Chinese characters in the address bar (see below), but based on my limited testing, it is not fully supported.

address bar with Chinese characters

NOTE. For Uri.EscapeDataString to work, iriParsing must be enabled.




EDIT

For those who want to generate Slugs URLs in C #, I recommend checking out this related question:

How does qaru generate urls optimized for SEO?

This is what I used for my project.

+11
Nov 12 '13 at 18:30
source share

One of the problems I ran into slugification (new word!) Is collisions. For example, if I have a blog post called "Stack-Overflow" and one is "Stack Overflow", the bars of these two names are the same. So my cork generator should usually use the database in some way. Perhaps that is why you do not see more general solutions.

+3
May 27 '10 at 12:40
source share

Here is my picture. It supports:

  • removal of diacritics (so we don’t just remove β€œinvalid” characters)
  • maximum length for the result (or before the removal of diacritics - "early truncation")
  • custom delimiter between normalized chunks
  • The result can be forced uppercase or lowercase.
  • custom list of supported unicode categories
  • custom list of valid character ranges
  • supports framework 2.0

the code:

 /// <summary> /// Defines a set of utilities for creating slug urls. /// </summary> public static class Slug { /// <summary> /// Creates a slug from the specified text. /// </summary> /// <param name="text">The text. If null if specified, null will be returned.</param> /// <returns> /// A slugged text. /// </returns> public static string Create(string text) { return Create(text, (SlugOptions)null); } /// <summary> /// Creates a slug from the specified text. /// </summary> /// <param name="text">The text. If null if specified, null will be returned.</param> /// <param name="options">The options. May be null.</param> /// <returns>A slugged text.</returns> public static string Create(string text, SlugOptions options) { if (text == null) return null; if (options == null) { options = new SlugOptions(); } string normalised; if (options.EarlyTruncate && options.MaximumLength > 0 && text.Length > options.MaximumLength) { normalised = text.Substring(0, options.MaximumLength).Normalize(NormalizationForm.FormD); } else { normalised = text.Normalize(NormalizationForm.FormD); } int max = options.MaximumLength > 0 ? Math.Min(normalised.Length, options.MaximumLength) : normalised.Length; StringBuilder sb = new StringBuilder(max); for (int i = 0; i < normalised.Length; i++) { char c = normalised[i]; UnicodeCategory uc = char.GetUnicodeCategory(c); if (options.AllowedUnicodeCategories.Contains(uc) && options.IsAllowed(c)) { switch (uc) { case UnicodeCategory.UppercaseLetter: if (options.ToLower) { c = options.Culture != null ? char.ToLower(c, options.Culture) : char.ToLowerInvariant(c); } sb.Append(options.Replace(c)); break; case UnicodeCategory.LowercaseLetter: if (options.ToUpper) { c = options.Culture != null ? char.ToUpper(c, options.Culture) : char.ToUpperInvariant(c); } sb.Append(options.Replace(c)); break; default: sb.Append(options.Replace(c)); break; } } else if (uc == UnicodeCategory.NonSpacingMark) { // don't add a separator } else { if (options.Separator != null && !EndsWith(sb, options.Separator)) { sb.Append(options.Separator); } } if (options.MaximumLength > 0 && sb.Length >= options.MaximumLength) break; } string result = sb.ToString(); if (options.MaximumLength > 0 && result.Length > options.MaximumLength) { result = result.Substring(0, options.MaximumLength); } if (!options.CanEndWithSeparator && options.Separator != null && result.EndsWith(options.Separator)) { result = result.Substring(0, result.Length - options.Separator.Length); } return result.Normalize(NormalizationForm.FormC); } private static bool EndsWith(StringBuilder sb, string text) { if (sb.Length < text.Length) return false; for (int i = 0; i < text.Length; i++) { if (sb[sb.Length - 1 - i] != text[text.Length - 1 - i]) return false; } return true; } } /// <summary> /// Defines options for the Slug utility class. /// </summary> public class SlugOptions { /// <summary> /// Defines the default maximum length. Currently equal to 80. /// </summary> public const int DefaultMaximumLength = 80; /// <summary> /// Defines the default separator. Currently equal to "-". /// </summary> public const string DefaultSeparator = "-"; private bool _toLower; private bool _toUpper; /// <summary> /// Initializes a new instance of the <see cref="SlugOptions"/> class. /// </summary> public SlugOptions() { MaximumLength = DefaultMaximumLength; Separator = DefaultSeparator; AllowedUnicodeCategories = new List<UnicodeCategory>(); AllowedUnicodeCategories.Add(UnicodeCategory.UppercaseLetter); AllowedUnicodeCategories.Add(UnicodeCategory.LowercaseLetter); AllowedUnicodeCategories.Add(UnicodeCategory.DecimalDigitNumber); AllowedRanges = new List<KeyValuePair<short, short>>(); AllowedRanges.Add(new KeyValuePair<short, short>((short)'a', (short)'z')); AllowedRanges.Add(new KeyValuePair<short, short>((short)'A', (short)'Z')); AllowedRanges.Add(new KeyValuePair<short, short>((short)'0', (short)'9')); } /// <summary> /// Gets the allowed unicode categories list. /// </summary> /// <value> /// The allowed unicode categories list. /// </value> public virtual IList<UnicodeCategory> AllowedUnicodeCategories { get; private set; } /// <summary> /// Gets the allowed ranges list. /// </summary> /// <value> /// The allowed ranges list. /// </value> public virtual IList<KeyValuePair<short, short>> AllowedRanges { get; private set; } /// <summary> /// Gets or sets the maximum length. /// </summary> /// <value> /// The maximum length. /// </value> public virtual int MaximumLength { get; set; } /// <summary> /// Gets or sets the separator. /// </summary> /// <value> /// The separator. /// </value> public virtual string Separator { get; set; } /// <summary> /// Gets or sets the culture for case conversion. /// </summary> /// <value> /// The culture. /// </value> public virtual CultureInfo Culture { get; set; } /// <summary> /// Gets or sets a value indicating whether the string can end with a separator string. /// </summary> /// <value> /// <c>true</c> if the string can end with a separator string; otherwise, <c>false</c>. /// </value> public virtual bool CanEndWithSeparator { get; set; } /// <summary> /// Gets or sets a value indicating whether the string is truncated before normalization. /// </summary> /// <value> /// <c>true</c> if the string is truncated before normalization; otherwise, <c>false</c>. /// </value> public virtual bool EarlyTruncate { get; set; } /// <summary> /// Gets or sets a value indicating whether to lowercase the resulting string. /// </summary> /// <value> /// <c>true</c> if the resulting string must be lowercased; otherwise, <c>false</c>. /// </value> public virtual bool ToLower { get { return _toLower; } set { _toLower = value; if (_toLower) { _toUpper = false; } } } /// <summary> /// Gets or sets a value indicating whether to uppercase the resulting string. /// </summary> /// <value> /// <c>true</c> if the resulting string must be uppercased; otherwise, <c>false</c>. /// </value> public virtual bool ToUpper { get { return _toUpper; } set { _toUpper = value; if (_toUpper) { _toLower = false; } } } /// <summary> /// Determines whether the specified character is allowed. /// </summary> /// <param name="character">The character.</param> /// <returns>true if the character is allowed; false otherwise.</returns> public virtual bool IsAllowed(char character) { foreach (var p in AllowedRanges) { if (character >= p.Key && character <= p.Value) return true; } return false; } /// <summary> /// Replaces the specified character by a given string. /// </summary> /// <param name="character">The character to replace.</param> /// <returns>a string.</returns> public virtual string Replace(char character) { return character.ToString(); } } 
+3
Sep 10 '15 at 6:52
source share



All Articles