Unicode Regex in Scala REPL

I want to define the words for Unicode letters ( \p{L} ).

Scala REPL returns false for the following statement, and in Java, true (this is the correct behavior):

java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches()

Both Java and Scala work in JRE 1.7:

System.getProperty("java.version") returns "1.7.0_60-ea"

What could be the reason?

+6
source share
2 answers

Probably the incompatible character encoding used in the interpreter. For example, here is my conclusion:

 scala> System.getProperty("file.encoding") res0: String = UTF-8 scala> java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches() res1: Boolean = true 

Thus, the solution should run scala with -Dfile.encoding=UTF-8 . Note, however, this blog post (which is a bit old):

The only reliable way we found to set the default character encoding for Scala is to set $ JAVA_OPTS before starting. Application:

$ JAVA_OPTS="-Dfile.encoding=utf8" scala [...] Just trying to set scala -Dfile.encoding=utf8 does not seem to do this. [...]


Not here, but it can happen this way: alternatively, your “ä” can be a diaeresis (umlaut) sign on “a”, for example:

 scala> println("a\u0308") ä scala> java.util.regex.Pattern.compile("\\p{L}").matcher("a\u0308").matches() res1: Boolean = false 

This is sometimes a problem on some systems that create diacritics through Unicode combining characters (I think OS X is one, at least on some versions). For more information, see the Question of Paul .

+4
source

You can also "Enable the Unicode version of the predefined character classes and POSIX character classes" as described in java.util.regex.Pattern and UNICODE_CHARACTER_CLASS

This means that you can use character classes such as '\ w' to match Unicode characters, for example:

 "(?U)\\w+".r.findFirstIn("pässi") 

In regexp above '(? U)', bits are flag expressions that include the UNICODE_CHARACTER_CLASS flag for regular expression.

This flag is supported since Java 7.

+1
source

All Articles