Why does this regex have 3 matches, not 5?

I wrote a pretty simple preg_match_all file in PHP:

$fileName = 'A_DATED_FILE_091410.txt'; $matches = array(); preg_match_all('/[0-9][0-9]/',$fileName,$matches); print_r($matches); 

My expected result:

 $matches = array( [0] => array( [0] => 09, [1] => 91, [2] => 14, [3] => 41, [4] => 10 ) ) 

What I got instead:

 $matches = array( [0] => array( [0] => 09, [1] => 14, [2] => 10 ) ) 

Now, in this particular case, this was preferable, but I wonder why this does not match other substrings? Also, a regular expression is possible that will give me the expected result, and if so, then what is it?

+4
source share
4 answers

With a global regex (which is used by preg_match_all ), after matching, the regex mechanism continues to search for the line from the end of the previous match.

In your case, the regex engine starts at the beginning of the line and advances to 0 , as this is the first character that matches [0-9] . Then he moves on to the next position ( 9 ), and since this corresponds to the second [0-9] , she takes 09 as a coincidence. When the engine continues to match (since it has not yet reached the end of the line), it again advances its position (to 1 ) (and then repeats above).

See also: First, see how the regex engine works


If you need to get every two-digit sequence, you can use preg_match and use offsets to determine where to start recording:

 $fileName = 'A_DATED_FILE_091410.txt'; $allSequences = array(); $matches = array(); $offset = 0; while (preg_match('/[0-9][0-9]/', $fileName, $matches, PREG_OFFSET_CAPTURE, $offset)) { list($match, $offset) = $matches[0]; $allSequences[] = $match; $offset++; // since the match is 2 digits, we'll start the next match after the first } 

Note that the offset returned with the PREG_OFFSET_CAPTURE flag is the beginning of a match.


I have another solution that will get five matches without using offsets, but I add it here only for curiosity, and I probably did not use it myself in the production code (this is a somewhat complicated regular expression too). You can use the regular expression that lookbehind uses to look up the number before the current position, and fix the number in lookbehind (in general, search queries are not captured):

 (?<=([0-9]))[0-9] 

Skip this regex:

 (?<= # open a positive lookbehind ( # open a capturing group [0-9] # match 0-9 ) # close the capturing group ) # close the lookbehind [0-9] # match 0-9 

Since the reverse sides are zero width and do not move the position of the regular expression, this regular expression will correspond 5 times: the engine will advance to 9 (because this is the first position that satisfies the lookbehind statement). Since 9 matches [0-9], the engine will accept 9 as a match (but because we capture the search, it will also capture 0 !). Then the engine moves to 1 . Again, lookbehind succeeds (and captures), and 1 added as the 1st subgroup (and so on, until the engine reaches the end of the line).

When we give this preg_match_all template, we get an array that looks like (using the PREG_SET_ORDER flag to group capture groups together with a complete match):

 Array ( [0] => Array ( [0] => 9 [1] => 0 ) [1] => Array ( [0] => 1 [1] => 9 ) [2] => Array ( [0] => 4 [1] => 1 ) [3] => Array ( [0] => 1 [1] => 4 ) [4] => Array ( [0] => 0 [1] => 1 ) ) 

Please note that each β€œmatch” has its numbers out of order! This is because the capture group in lookbehind becomes backreference 1, while the entire match is backreference 0. We can return it together in the correct order, though:

 preg_match_all('/(?<=([0-9]))[0-9]/', $fileName, $matches, PREG_SET_ORDER); $allSequences = array(); foreach ($matches as $match) { $allSequences[] = $match[1] . $match[0]; } 
+7
source

The search for the next match begins with the first character after the previous match. Therefore, when 09 matches at 091410 , the search for the next match starts at 1410 .

+2
source

Also, a regular expression is possible that give me my expected result, and if so, what is it?

None of them will work, because it will not coincide with the same section twice. But you can do something like this:

 $i = 0; while (preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, $i)) { $i = $matches[0][1]; /* + 1 in many cases */ } 

The above is unsafe for the general case. You can get stuck in an infinite loop, depending on the pattern. Also, you might not need [0][1] , but instead something like [1][1] etc., Again, depending on the template.

In this particular case, I think it would be much easier to do it yourself:

 $l = strlen($s); $prev_digit = false; for ($i = 0; $i < $l; ++$i) { if ($s[$i] >= '0' && $s[$i] <= '9') { if ($prev_digit) { /* found match */ } $prev_digit = true; } else $prev_digit = false; } 
+1
source

Just for fun, another way to do this:

  <?php $fileName = 'A_DATED_FILE_091410.txt'; $matches = array(); preg_match_all('/(?<=([0-9]))[0-9]/',$fileName,$matches); $result = array(); foreach($matches[1] as $i => $behind) { $result[] = $behind . $matches[0][$i]; } print_r($result); ?> 
+1
source

All Articles