I am trying to find an optimized regular expression to return N words (if available) around another to create a summary. The string is in UTF-8, so the definition of "words" is more than just [az]. A line that serves as a reference word may be in the middle of the word or not directly in space.
I already have the following that works, but it seems actually greedy and tantalizing when looking for more than 6-7 words around another:
/(?:[^\s\r\n]+[\s\r\n]+[^\s\r\n]*){0,4}lorem(?:[^\s\r\n]*[\s\r\n]+[^\s\r\n]+){0,4}/u
This is the PHP method I created for this, but I need help making the regex less greedy and work for any number of words.
/** * Finds N words around a specified word in a string. * * @param string $string The complete string to look in. * @param string $find The string to look for. * @param integer $before The number of words to look for before $find. * @param integer $after The number of words to look for after $find. * @return mixed False if $find was not found and all the words around otherwise. */ private function getWordsAround($string, $find, $before, $after) { $matches = array(); $find = preg_quote($find); $regex = '(?:[^\s\r\n]+[\s\r\n]+[^\s\r\n]*){0,' . (int)$before . '}' . $find . '(?:[^\s\r\n]*[\s\r\n]+[^\s\r\n]+){0,' . (int)$after . '}'; if (preg_match("/$regex/u", $string, $matches)) { return $matches[0]; } else { return false; } }
If I had the following line $:
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, felis non vehicula suscipit, enim quam adipiscing turpis, eget rutrum eros velit non enim. Sed commodo cursus vulputate. Aliquam id diam sed arcu fringilla venenatis. Cras vitae ante ut tellus malesuada convallis. Vivamus luctus ante vel ligula eleifend condimentum. Donec a vulputate velit. Suspendisse velit risus, volutpat at dapibus vitae, viverra vel nulla."
And called getWordsAround($string, 'vitae', 8, 8) , I want to get the following result:
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, felis non vehicula suscipit,"
Thank you for helping regex guru.