Optimized regular expression for N words around a given word (UTF-8)

Question

Optimized regular expression for N words around a given word (UTF-8)

I am trying to find an optimized regular expression to return N words (if available) around another to create a summary. The string is in UTF-8, so the definition of "words" is more than just [az]. A line that serves as a reference word may be in the middle of the word or not directly in space.

I already have the following that works, but it seems actually greedy and tantalizing when looking for more than 6-7 words around another:

/(?:[^\s\r\n]+[\s\r\n]+[^\s\r\n]*){0,4}lorem(?:[^\s\r\n]*[\s\r\n]+[^\s\r\n]+){0,4}/u

This is the PHP method I created for this, but I need help making the regex less greedy and work for any number of words.

 /** * Finds N words around a specified word in a string. * * @param string $string The complete string to look in. * @param string $find The string to look for. * @param integer $before The number of words to look for before $find. * @param integer $after The number of words to look for after $find. * @return mixed False if $find was not found and all the words around otherwise. */ private function getWordsAround($string, $find, $before, $after) { $matches = array(); $find = preg_quote($find); $regex = '(?:[^\s\r\n]+[\s\r\n]+[^\s\r\n]*){0,' . (int)$before . '}' . $find . '(?:[^\s\r\n]*[\s\r\n]+[^\s\r\n]+){0,' . (int)$after . '}'; if (preg_match("/$regex/u", $string, $matches)) { return $matches[0]; } else { return false; } }

If I had the following line $:

 "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, felis non vehicula suscipit, enim quam adipiscing turpis, eget rutrum eros velit non enim. Sed commodo cursus vulputate. Aliquam id diam sed arcu fringilla venenatis. Cras vitae ante ut tellus malesuada convallis. Vivamus luctus ante vel ligula eleifend condimentum. Donec a vulputate velit. Suspendisse velit risus, volutpat at dapibus vitae, viverra vel nulla."

And called getWordsAround($string, 'vitae', 8, 8) , I want to get the following result:

 "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, felis non vehicula suscipit,"

Thank you for helping regex guru.

+6

php regex pcre utf-8

lpfavreau Aug 27 '10 at 15:43

source share

5 answers

a'r · Answer 1 · 2010-08-27T16:16:30+0000

How to use a regular expression or some other method to split the input text into an array of words. Then skip the words with the loop looking for the target word. Once it is found, take the required fragment of the array, connect it and print.

To preserve the original spaces between words, you can include them at the end of each word.

In addition, this can be implemented as a stream parser, rather than splitting the entire string first.

wuputah · Answer 2 · 2010-08-27T16:50:01+0000

As mentioned earlier, the problem is a very large number of deviations. To solve this problem, I tried using lookbehind and lookahead to bind matching to a string. So I came up with:

 /consectetur(?<=((?:\S+\s+){0,8})\s*consectetur)\s*(?=((?:\S+\s+){0,8}))/

Unfortunately, this does not work, since variable lookbehind lengths are not supported in PCRE (or perl, for that matter). So we are left with:

 /consectetur\s*(?:\S+\s+){0,8}/

Which only captures the match string and up to 8 words after the match. However, if you use the PREG_OFFSET_CAPTURE flag , get the offset $match[0] , adjust to this point, cancel the line with strrev , get the first 0-8 words (using /\s*(?:\S+\s+){0,8}/ ), cancel the match and recombine:

 $s = "put test string here"; $matches = array(); if (preg_match('/consectetur\s*(?:\S+\s+){0,8}/', $s, $matches, PREG_OFFSET_CAPTURE)) { $before = strrev(substr($s, 0, $matches[0][1])); $before_match = array(); preg_match('/\s*(?:\S+\s+){0,8}/', $before, $before_match); echo strrev($before_match[0]) . $matches[0][0]; }

You can do this a little faster on very large lines by taking a safe subset of characters before the match, such as 100. Then you only change the 100-digit line.

All that said, a solution that does not use regular expressions may work better.

Artefacto · Answer 3 · 2010-08-28T03:21:21+0000

Here is an internal PHP function that does what you want. It is unbelievable that you will be able to surpass this performance on the user system.

This function cannot be used for UTF-8 functions, since "\ r", "\ n" and "" (and in general all ASCII characters) cannot be displayed as part of another character sequence. Therefore, if you pass valid UTF-8 data for both parameters, you should be fine. UTF-8 data reversal, since you usually change single character encodings (using strrev ), does mean problems, but this function does not.

 PHP_FUNCTION(surrounding_text) { struct circ_array { int *offsets; int cur; int size; } circ_array; long before; long after; char *haystack, *needle; int haystack_len, needle_len; int i, in_word = 0, in_match = 0; if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ssll", &haystack, &haystack_len, &needle, &needle_len, &before, &after) == FAILURE) return; if (needle_len == 0) { php_error_docref(NULL TSRMLS_CC, E_WARNING, "Cannot have empty needle"); return; } if (before < 0 || after < 0) { php_error_docref(NULL TSRMLS_CC, E_WARNING, "Number of words after and before should be non-negative"); return; } /* saves beggining of match and words before */ circ_array.offsets = safe_emalloc(before + 1, sizeof *circ_array.offsets, 0); circ_array.cur = 0; circ_array.size = before + 1; for (i = 0; i < haystack_len; i++) { if (haystack[i] == needle[in_match]) { in_match++; if (!in_word) { in_word = 1; circ_array.offsets[circ_array.cur % circ_array.size] = i; circ_array.cur++; } if (in_match == needle_len) break; /* found */ } else { int is_sep = haystack[i] == ' ' || haystack[i] == '\n' || haystack[i] == '\r'; if (in_match) in_match = 0; if (is_sep) { if (in_word) in_word = 0; } else { /* not a separator */ if (!in_word) { in_word = 1; circ_array.offsets[circ_array.cur % circ_array.size] = i; circ_array.cur++; } } } } if (in_match != needle_len) { efree(circ_array.offsets); RETURN_FALSE; } /* find words after; in_word is 1 */ for (i++; i < haystack_len; i++) { int is_sep = haystack[i] == ' ' || haystack[i] == '\n' || haystack[i] == '\r'; if (is_sep) { if (in_word) { if (after == 0) break; after--; in_word = 0; } } else { if (!in_word) in_word = 1; } } { char *result; int start, result_len; if (circ_array.cur < circ_array.size) start = circ_array.offsets[0]; else start = circ_array.offsets[circ_array.cur % circ_array.size]; result_len = i - start; result = emalloc(result_len + 1); memcpy(result, &haystack[start], result_len); result[result_len] = '\0'; efree(circ_array.offsets); RETURN_STRINGL(result, result_len, 0); } }

From my tests, the C function is 4 times faster than the wuputah version (and has no problem with strrev ).

Artefacto · Answer 4 · 2010-08-27T15:57:07+0000

This worked perfectly:

 (?:[^\s\r\n]*[\s\r\n]+){0,8}(?:[^\s\r\n]*)consectetur(?:[^\s\r\n]*)(?:[\s\r\n]+[^\s\r\n]*){0,8}

gives:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, felis non vehicleula suscipit,

The performance of this regex, however, is absolute shit. I really don't know how to make this more efficient without doing it without regular expressions.

The reason that performance is “absolute crap” for words near the end is because the engine tries to start the match on each character, and then advances several dozen characters until it finds out that at the end, it cannot find the line that you are looking and discarding everything.

NullUserException · Answer 5 · 2010-08-27T16:20:10+0000

The problem with using this regex is that it leads to a catastrophic reverse regex movement. The number of attempts increases exponentially with the size of the string, and this is not good. You may want to study atomic grouping to improve performance.

Alternatively, you can find the first occurrence of a given word and start looking back and forth for words to the desired length. Pseudo Code:

 $pos = strpos($find); $result = $find; foreach $word before $pos { $result = $word . $result; $count++ if ($count >= $target) break; } foreach $word after $pos { $result .= $word; $count++ if ($count >= $target) break; }

Of course, searching for words before and after, and processing partial lines can be really messy.

Optimized regular expression for N words around a given word (UTF-8)

More articles: