Surrogate pair error

I am working on a minor project in F # that involves porting existing C # code to F #, and I seem to be in the difference in how regular expressions are processed between the two languages โ€‹โ€‹(I publish this to hope to find I'm just doing that something wrong).

This minor function simply detects surrogate pairs using the regular expression trick indicated here . Here's the current implementation:

let isSurrogatePair input = Regex.IsMatch(input, "[\uD800-\uDBFF][\uDC00-\uDFFF]") 

If I then execute it against a famous surrogate pair, like this:

 let result = isSurrogatePair "๐ ฎท้‡Ž๐ ฎท" printfn "%b" result 

I get false in the FSI window.

If I use equivalent C #:

 public bool IsSurrogatePair(string input) { return Regex.IsMatch(input, "[\uD800-\uDBFF][\uDC00-\uDFFF]"); } 

And the same input value, I (correct), get true back.

Is this a real problem? Am I just doing something wrong in my F # implementation?

+7
regex unicode f # surrogate-pairs
source share
2 answers

It seems that the error is in the way F # encodes Unicode escaped characters.
Here from F # Interactive (note the last two results):

 > "\uD500".[0] |> uint16 ;; val it : uint16 = 54528us > "\uD700".[0] |> uint16 ;; val it : uint16 = 55040us > "\uD800".[0] |> uint16 ;; val it : uint16 = 65533us > "\uD900".[0] |> uint16 ;; val it : uint16 = 65533us 

Fortunately, this workaround works:

 > let s = new System.String( [| char 0xD800 |] ) s.[0] |> uint16 ;; val s : System.String = " " val it : uint16 = 55296us 

Based on this finding, I can build a fixed (or rather a isSurrogatePair ) version of isSurrogatePair :

 let isSurrogatePair input = let chrToStr code = new System.String( [| char code |] ) let regex = "[" + (chrToStr 0xD800) + "-" + (chrToStr 0xDBFF) + "][" + (chrToStr 0xDC00) + "-" + (chrToStr 0xDFFF) + "]" Regex.IsMatch(input, regex) 

This version correctly returns true for input.

I just wrote this issue on GitHub: https://github.com/fsharp/fsharp/issues/399

+8
source share

This seems to be a legitimate F # error, with no arguments. Just wanted to offer alternative workarounds.


Do not insert problem characters in a string; specify them using regular support for regular expressions in a regular expression. The regex pattern to match the unicode code number XXXX is \uXXXX , so just avoid backslashes or use the shorthand line:

 Regex.IsMatch(input, "[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]") // or Regex.IsMatch(input, @"[\uD800-\uDBFF][\uDC00-\uDFFF]") 

Use the built-in regular expression support for Unicode blocks:

 // high surrogate followed by low surrogate Regex.IsMatch(input, @"(\p{IsHighSurrogates}|\p{IsHighPrivateUseSurrogates})\p{IsLowSurrogates}") 

or properties

 // 2 characters, each of which is half of a surrogate pair // (maybe could give false-positive if both are, eg low-surrogates) Regex.IsMatch(input, @"\p{Cs}{2}") 
+2
source share

All Articles