Subtracting a Regex Character Class Using PHP

Hi

I am trying to match UK postcodes using the template from http://interim.cabinetoffice.gov.uk/media/291370/bs7666-v2-0-xsd-PostCodeType.htm ,

/^[AZ]{1,2}[0-9R][0-9A-Z]? [0-9][AZ-[CIKMOV]]{2}$/ 

I use this in PHP but does not match the valid OL13 0EF . This zip code really matches if I remove the class subtraction -[CIKMOV] .

I get the impression that I'm incorrectly subtracting a character class in PHP. I would really appreciate it if someone could correct my mistake.

Thanks in advance for your help.

Ross

+4
source share
4 answers

Most regular expression flavors do not support character class subtraction. Instead, you can use a forward-looking statement:

 /^[AZ]{1,2}[0-9R][0-9A-Z]? [0-9](?!.?[CIKMOV])[AZ]{2}$/ 
+7
source

If class subtraction is not supported, you should use negative classes to achieve subtractions.

Some examples: [^\D] = \d , [^[:^alpha:]] = [a-zA-Z]

Your problem can be solved in a similar way using a negative POSIX character class inside a character class, for example [^az[:^alpha:]CIKMOV]

[^
az # not az
[:^alpha:] # not not A-Za-z
CIKMOV # not C,I,K,M,O,V
]

Change This also works and it might be easier to read: [^[:^alpha:][:lower:]CIKMOV]

[^
[:^alpha:] # A-Za-z
[:lower:] # not az
CIKMOV # not C,I,K,M,O,V
]

The result is a character class that is AZ without C, I, K, M, O, V
basically subtraction.

Here is a test of two different classes (in Perl):

 use strict; use warnings; my $match = ''; # ANYOF[^\ 0-@CIKMOV [-\377!utf8::IsAlpha] for (0 .. 255) { if (chr($_) =~ /^[^az[:^alpha:]CIKMOV]$/) { $match .= chr($_); next; } $match .= ' '; } $match =~ s/^ +//; $match =~ s/ +$//; print "'$match'\n"; $match = ''; # ANYOF[^\ 0-@CIKMOV [-\377+utf8::IsDigit !utf8::IsWord] for (0 .. 255) { if (chr($_) =~ /^[^az\d\W_CIKMOV]$/) { $match .= chr($_); next; } $match .= ' '; } $match =~ s/^ +//; $match =~ s/ +$//; print "'$match'\n"; 

The output shows a termination in AZ minus CIKMOV, from ascii 0-255 checked characters:
'AB DEFGH JLN PQRSTU WXYZ'
'AB DEFGH JLN PQRSTU WXYZ'

+5
source

PCRE does not support char class subtraction.

So, you can list all the uppercase letters except CIKMOV :

 ^[AZ]{1,2}[0-9R][0-9A-Z]? [0-9][ABDEFGHJLNPQRSTUWXYZ]{2}$ 

which can be shorted using a range like:

 ^[AZ]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-JLNP-UW-Z]{2}$ 
+4
source

I think you will have to replace [AZ-[CIKMOV]] with [ABD-HJLNP-UW-Z] . I don't think php supports character class expression. My alternative reads something like "A, B, D in H, J, L, N, P in U and W in Z".

+1
source

All Articles