Using regex to test comma usage

Question

Using regex to test comma usage

How can I write a regular expression that indicates improper use of a comma in a string, for example: 1. for non numbers, a space before and 1 space after; 2. for numbers, commas are allowed if they are preceded by 1-3 digits, followed by 3 digits.

Some test cases:

Hello World
hello world => wrong
hello world => wrong
1,234 world
1.23 worlds => incorrect
1.2345 worlds => incorrect
hi, 123 worlds => wrong
hi 1234 567 worlds => wrong
hi 12,34,567 worlds => wrong
(new test case) hello 1, 2 and 3 of the world
(new test case) hello $ 1,234 of the world
(new test case) hello $ 1.2345 worlds => wrong
(new test case) hello "1,234" worlds
(new test case) hi "1.23" worlds => wrong

So, I thought that I would have a regular expression to capture words with bad syntax through (?![\S\D],[\S\D]) (capture where there is a space / digit followed by a comma not a space / digit) and join it with another regex to write numbers with bad syntax, via (?!(.?^(?:\d+|\d{1,3}(?:,\d{3}))(?:.\d+) . Combining this, I get

  preg_match_all ("/ (?! [\ S \ D], [\ S \ D]) | (?! (. *? ^ (?: \ d + | \ d {1,3} (?:, \ d { 3}) *) (?: \. \ D +)? $)) / ", $ Str, $ syntax_result);

.. but obviously this will not work. How to do it?

================= EDIT =================

Thanks to Casimir and Hippolytus below, I got him to work! I updated his answer to take care of more cases. Idk if the syntax I added is the most efficient, but it works, for now. I will update this when more corner cases come up!

 $pattern = <<<'LOD' ~ (?: # this group contains allowed commas [\w\)]+,((?=[ ][\w\s\(\"]+)|(?=[\s]+)) # comma between words or line break | (?<=^|[^\PP,]|[£$\s]) [0-9]{1,3}(?:,[0-9]{3})* (?=[€\s]|[^\PP,]|$) # thousands separator ) (*SKIP) (*FAIL) # make the pattern fail and forbid backtracking | , # other commas ~mx LOD;

+6

php regex

Alex Dec 08 '13 at 0:24

source share

1 answer

Casimir et Hippolyte · Accepted Answer · 2013-12-08T00:53:30+0000

It is not waterproof, but it can give you an idea of how to proceed:

 $pattern = <<<'LOD' ~ (?: # this group contains allowed commas \w+,(?=[ ]\w+) # comma between words | (?<=^|[^\PP,]|[£$\s]) [0-9]{1,3}(?:,[0-9]{3})* (?=[€\s]|[^\PP,]|$) # thousands separator ) (*SKIP) (*FAIL) # make the pattern fail and forbid backtracking | , # other commas ~mx LOD; preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE); print_r($matches[0]);

The idea is to exclude allowed commas from the result of the match, in order to get only invalid commas. The first group, not related to capture, contains a blacklist for the correct situation. (You can easily add other cases).

[^\PP,] means "all punctuation except", but you can replace this character class with a more explicit list of valid characters, for example: [("']

You can find more information about (*SKIP) and (*FAIL) here and here .

Using regex to test comma usage

More articles: