Why do I see different results for these two almost identical Ruby regex patterns and why does one match what I think shouldn't?

Question

Why do I see different results for these two almost identical Ruby regex patterns and why does one match what I think shouldn't?

Using Ruby 1.9.2, I have the following Ruby code in IRB:

> r1 = /^(?=.*[\d])(?=.*[\W]).{8,20}$/i > r2 = /^(?=.*\d)(?=.*\W).{8,20}$/i > a = ["password", "1password", "password1", "pass1word", "password 1"] > a.each {|p| puts "r1: #{r1.match(p) ? "+" : "-"} \"#{p}\"".ljust(25) + "r2: #{r2.match(p) ? "+" : "-"} \"#{p}\""}

The result is the following:

 r1: - "password" r2: - "password" r1: + "1password" r2: - "1password" r1: + "password1" r2: - "password1" r1: + "pass1word" r2: - "pass1word" r1: + "password 1" r2: + "password 1"

1.) Why are the results different?

2.) Why r1 correspond to lines 2, 3, and 4? Wouldn't the search (?=.*[\W]) fail because there are no characters without words in these examples?

+6

ruby regex unicode character-class

Chris bloom Nov 26 '12 at 21:04

source share

1 answer

matt · Accepted Answer · 2012-11-26T22:14:51+0000

This is due to the interaction between the two regular expression functions and Unicode. \W are all characters other than words that include 212A - "KELVIN SIGN" K (link in PDF format) and 017F - "LATIN SMALL LETTER LONG S" ſ (link in PDF format) . /i adds lowercase versions of both of them, which are the "normal" characters k and s ( 006B - "LATIN SMALL LETTER K" and 0073 "LATIN SMALL LETTER S" (PDF link ).

So, its s in password , which is interpreted as a non-word character in some cases.

Note that this only happens when \W is in the character class (ie [\W] ). Also I can only reproduce this in irb , inside a stand-alone script it works as expected.

See the Ruby bug about this for more information.

Why do I see different results for these two almost identical Ruby regex patterns and why does one match what I think shouldn't?

More articles: