.NET Regex for white characters

Consider an algorithm that should determine if a string character contains any characters outside of white characters.

The whitelist is as follows:

-. AbcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ÇüéâäàäçêêèèÄÄÅæÆæÆööööÜÜÜðððõøýþÿ

Note: spaces and apostrophes are required to be included in this whitelist.

This will usually be a static method, but it will be converted to an extension method.

 private bool ContainsAllWhitelistedCharacters(string input) { string regExPattern="";// the whitelist return Regex.IsMatch(input, regExPattern); } 

Questions:

Thanks for the performance comments to all responders. Performance is not a problem. Quality, readability and maintainability! Less code = less chance of defects, IMO.

Question:

What should be the whitelist regular expression pattern?

+4
source share
4 answers

You can map the image using the following:

 ^([\-\.a-zA-Z ÇüéâäàåçêëèïîíìÄÅÉæÆôöòûùÖÜáíóúñÑÀÁÂÃÈÊËÌÍÎÏÐÒÓÔÕØÙÚÛÝßãðõøýþÿ]+)$ 

Make it an extension method with

 public static bool IsValidCustom(this string value) { string regExPattern="^([\-\.a-zA-Z ÇüéâäàåçêëèïîíìÄÅÉæÆôöòûùÖÜáíóúñÑÀÁÂÃÈÊËÌÍÎÏÐÒÓÔÕØÙÚÛÝßãðõøýþÿ]+)$"; return Regex.IsMatch(input, regExPattern); } 

I cannot think of a simple way to make a supported range with extended characters, since the order of the characters is not obvious.

+4
source

Why should it be a regular expression?

 private bool ContainsAllWhitelistedCharacters(string input) { string whitelist = "abcdefg..."; foreach (char c in input) { if (whitelist.IndexOf(c) == -1) return false; } return true; } 

You don’t have to go straight into regular expressions if you don’t know how to implement the one you need and you haven’t profiled this section of the code and found out that you need extra performance.

+5
source

I don’t know how regex backs are implemented, but using the following for anything other than your list might be most effective:

 private bool ContainsAllWhitelistedCharacters(string input) { Regex r = new Regex("[^ your list of chars ]"); return !r.IsMatch(test) } 
0
source

Note that I do not recommend this if performance is not an issue, but I thought I wanted to point out that even if you precompile the regular expression, you can do it pretty quickly:

compare:

 static readonly Regex r = new Regex( @"^(['\-\.a-zA-Z ÇüéâäàåçêëèïîíìÄÅÉæÆôöòûùÖÜáíóúñÑ"+ "ÀÁÂÃÈÊËÌÍÎÏÐÒÓÔÕØÙÚÛÝßãðõøýþÿ]+)$"); public bool IsValidCustom(string value) { return r.IsMatch(value); } 

with:

 private bool ContainsAllWhitelistedCharacters(string input) { foreach (var c in input) { switch (c) { case '\u0020': continue; case '\u0027': continue; case '\u002D': continue; case '\u002E': continue; case '\u0041': continue; case '\u0042': continue; case '\u0043': continue; case '\u0044': continue; case '\u0045': continue; case '\u0046': continue; case '\u0047': continue; case '\u0048': continue; case '\u0049': continue; case '\u004A': continue; case '\u004B': continue; case '\u004C': continue; case '\u004D': continue; case '\u004E': continue; case '\u004F': continue; case '\u0050': continue; case '\u0051': continue; case '\u0052': continue; case '\u0053': continue; case '\u0054': continue; case '\u0055': continue; case '\u0056': continue; case '\u0057': continue; case '\u0058': continue; case '\u0059': continue; case '\u005A': continue; case '\u0061': continue; case '\u0062': continue; case '\u0063': continue; case '\u0064': continue; case '\u0065': continue; case '\u0066': continue; case '\u0067': continue; case '\u0068': continue; case '\u0069': continue; case '\u006A': continue; case '\u006B': continue; case '\u006C': continue; case '\u006D': continue; case '\u006E': continue; case '\u006F': continue; case '\u0070': continue; case '\u0071': continue; case '\u0072': continue; case '\u0073': continue; case '\u0074': continue; case '\u0075': continue; case '\u0076': continue; case '\u0077': continue; case '\u0078': continue; case '\u0079': continue; case '\u007A': continue; case '\u00C0': continue; case '\u00C1': continue; case '\u00C2': continue; case '\u00C3': continue; case '\u00C4': continue; case '\u00C5': continue; case '\u00C6': continue; case '\u00C7': continue; case '\u00C8': continue; case '\u00C9': continue; case '\u00CA': continue; case '\u00CB': continue; case '\u00CC': continue; case '\u00CD': continue; case '\u00CE': continue; case '\u00CF': continue; case '\u00D0': continue; case '\u00D1': continue; case '\u00D2': continue; case '\u00D3': continue; case '\u00D4': continue; case '\u00D5': continue; case '\u00D6': continue; case '\u00D8': continue; case '\u00D9': continue; case '\u00DA': continue; case '\u00DB': continue; case '\u00DC': continue; case '\u00DD': continue; case '\u00DF': continue; case '\u00E0': continue; case '\u00E1': continue; case '\u00E2': continue; case '\u00E3': continue; case '\u00E4': continue; case '\u00E5': continue; case '\u00E6': continue; case '\u00E7': continue; case '\u00E8': continue; case '\u00E9': continue; case '\u00EA': continue; case '\u00EB': continue; case '\u00EC': continue; case '\u00ED': continue; case '\u00EE': continue; case '\u00EF': continue; case '\u00F0': continue; case '\u00F1': continue; case '\u00F2': continue; case '\u00F3': continue; case '\u00F4': continue; case '\u00F5': continue; case '\u00F6': continue; case '\u00F8': continue; case '\u00F9': continue; case '\u00FA': continue; case '\u00FB': continue; case '\u00FC': continue; case '\u00FD': continue; case '\u00FE': continue; case '\u00FF': continue; } return false; } return true; // empty string is true } 

In very fast testing on the corpus of words with a bandwidth of about 60%, I get about such a coefficient to speed up this approach.

This is actually no less readable than a regular expression without escape characters!

0
source

All Articles