Matching a specific unicode char in haskell regexp

Question

Matching a specific unicode char in haskell regexp

This is a problem with Mac / OSX!

I have the following string with a length of three characters:

"a\160b"

I want to combine and replace the middle character

Several approaches, such as

 ghci> :m +Text.Regex ghci> subRegex (mkRegex "\160") "a\160b" "X" "*** Exception: user error (Text.Regex.Posix.String died: (ReturnCode 17,"illegal byte sequence")) ghci> subRegex (mkRegex "\\160") "a\160b" "X" "a\160b"

did not give the desired result.

How do I change the regexp variable or my environment to replace "\ 160" with "X"?

The problem seems to have a root in the input locale / encoding.

 bash> locale LANG= LC_COLLATE="C" LC_CTYPE="UTF-8" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL=

I have already changed my .bashrc to export the following env-vars:

 bash> locale LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL="en_US.UTF-8"

But that didn't change the behavior at all.

+8

regex unicode haskell macos

Axel tetzlaff Feb 18 '11 at 23:24

source share

2 answers

David powell · Answer 1 · 2011-02-24T04:26:59+0000

I was able to reproduce your problem by setting my language to 'en_US.UTF-8'. (I also use MacOSX.)

 bash> export LANG=en_US.UTF-8 bash> ghci GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help Prelude> :m +Text.Regex Prelude Text.Regex> subRegex (mkRegex "\160") "a\160b" "X" "*** Exception: user error (Text.Regex.Posix.String died: (ReturnCode 17,"illegal byte sequence"))

Setting your locale to "C" should fix the problem:

 bash> export LANG=C bash> ghci GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help Prelude> :m +Text.Regex Prelude Text.Regex> subRegex (mkRegex "\160") "a\160b" "X" "aXb"

Unfortunately, I have no explanation why the locale is causing this problem.

nominolo · Answer 2 · 2011-02-19T11:58:43+0000

Is there a specific reason why you want to use regular expressions rather than just map ?

 replace :: Char -> Char replace '\160' = 'X' replace c = c test = map replace "a\160b" == "aXb"

Note that if you want to work with Unicode strings, it may be easier to use the text package, which is designed to handle Unicode and more efficient than String for large strings.

Matching a specific unicode char in haskell regexp

More articles: