Surrogate pair error

Question

Surrogate pair error

I am working on a minor project in F # that involves porting existing C # code to F #, and I seem to be in the difference in how regular expressions are processed between the two languages (I publish this to hope to find I'm just doing that something wrong).

This minor function simply detects surrogate pairs using the regular expression trick indicated here . Here's the current implementation:

let isSurrogatePair input = Regex.IsMatch(input, "[\uD800-\uDBFF][\uDC00-\uDFFF]")

If I then execute it against a famous surrogate pair, like this:

 let result = isSurrogatePair "𠮷野𠮷" printfn "%b" result

I get false in the FSI window.

If I use equivalent C #:

 public bool IsSurrogatePair(string input) { return Regex.IsMatch(input, "[\uD800-\uDBFF][\uDC00-\uDFFF]"); }

And the same input value, I (correct), get true back.

Is this a real problem? Am I just doing something wrong in my F # implementation?

+7

regex .net unicode f # surrogate-pairs

Sven grosen Mar 31 '15 at 2:05

source share

2 answers

This seems to be a legitimate F # error, with no arguments. Just wanted to offer alternative workarounds.

Do not insert problem characters in a string; specify them using regular support for regular expressions in a regular expression. The regex pattern to match the unicode code number XXXX is \uXXXX , so just avoid backslashes or use the shorthand line:

 Regex.IsMatch(input, "[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]") // or Regex.IsMatch(input, @"[\uD800-\uDBFF][\uDC00-\uDFFF]")

Use the built-in regular expression support for Unicode blocks:

 // high surrogate followed by low surrogate Regex.IsMatch(input, @"(\p{IsHighSurrogates}|\p{IsHighPrivateUseSurrogates})\p{IsLowSurrogates}")

or properties

 // 2 characters, each of which is half of a surrogate pair // (maybe could give false-positive if both are, eg low-surrogates) Regex.IsMatch(input, @"\p{Cs}{2}")

+2

latkin Apr 1 '15 at 4:43

source share

Fyodor soikin · Accepted Answer · 2015-03-31T04:26:28+0000

It seems that the error is in the way F # encodes Unicode escaped characters.
Here from F # Interactive (note the last two results):

 > "\uD500".[0] |> uint16 ;; val it : uint16 = 54528us > "\uD700".[0] |> uint16 ;; val it : uint16 = 55040us > "\uD800".[0] |> uint16 ;; val it : uint16 = 65533us > "\uD900".[0] |> uint16 ;; val it : uint16 = 65533us

Fortunately, this workaround works:

 > let s = new System.String( [| char 0xD800 |] ) s.[0] |> uint16 ;; val s : System.String = " " val it : uint16 = 55296us

Based on this finding, I can build a fixed (or rather a isSurrogatePair ) version of isSurrogatePair :

 let isSurrogatePair input = let chrToStr code = new System.String( [| char code |] ) let regex = "[" + (chrToStr 0xD800) + "-" + (chrToStr 0xDBFF) + "][" + (chrToStr 0xDC00) + "-" + (chrToStr 0xDFFF) + "]" Regex.IsMatch(input, regex)

This version correctly returns true for input.

I just wrote this issue on GitHub: https://github.com/fsharp/fsharp/issues/399

Surrogate pair error

More articles: