Combine multiple terms in <body> tags

Question

Combine multiple terms in <body> tags

I want to match any occurrence of a search query (or a list of search terms) in the tags of a document. My current solution uses preg (inside the Joomla plugin)

$pattern = '/matchthisterm/i'; $article->text = preg_replace($pattern,"<span class=\"highlight\">\\0</span>",$article->text);

But this replaces everything inside the HTML document, so I need to match the tags first. Is this even the best way to achieve this?

EDIT: Well, I used simplehtmldom, but I just need help to get the right term. So far I have:

 $pattern = '/(matchthisterm)/i'; $html = str_get_html($buffer); $es = $html->find('text'); foreach ($es as $term) { //Match to the terms within the text nodes if (preg_match($pattern, $term->plaintext)) { $term->outertext = '<span class="highlight">' . $term->outertext . '</span>'; } }

This makes all node text in bold, can I use preg_replace here?

DECISION:

 //Get the HTML and look at the text nodes $html = str_get_html($buffer); $es = $html->find('text'); foreach ($es as $term) { //Match to the terms within the text nodes $term->outertext = str_ireplace('matchthis', '<span class="highlight">matchthis</span>', $term->outertext); }

0

html regex

Jeepstone Apr 7 '10 at 8:06

source share

3 answers

I agree that handling regex HTML is not a good solution.

I just read the argument about why regex cannot parse HTML here: RegEx matches open tags except XHTML stand-alone tags

I completely agree with everything, but the problem here is much simpler: we just need to know whether we are inside any HTML tag or not. We do not need to analyze the HTML structure and interpret the tree and the mismatched tags or some other errors. We just know that an HTML tag is a cross between <and>. I believe that regex is a very good, adapted and consistent tool here.

This is not because we are dealing with some HTML that we do not want to use regex. We need to focus on the real issue here, which I believe really does not handle HTML. We only need to know if we are inside the tag or not. I hope I do not get too much for this, but I fully understand my position.

I redirect you to a previous post (where you put a link to this topic) I did this before: Select the text, except for the html tags

On the same idea, and I hope that we know everything that we need, you use preg_replace() , where there will be a simpler function like str_ireplace() . If you just need to replace a word (or set of words) inside a string and deal with case insensitivity, don't use a regex. Keep it simple. (I assume that you did not simplify the replacement you are trying to do to explain your problem here).

+1

Savageman Apr 7 '10 at 22:15

source share

I did not use preg, but before I did pattern matching in perl, java and actionscript. If this is something similar, you need to avoid special characters. For example, "\<span class... I found a website that talks about using preg if you haven't met this website, which can be found here

0

Kyra Apr 7 '10 at 8:22

source share

bobince · Accepted Answer · 2010-04-07T08:33:28+0000

No, regexp processing of [X] [HT] ML is largely catastrophic. In the simplest case, for your example, this input:

 <a href="/foo/matchthisterm/bar">bof</a>

gives a pretty badly broken output:

 <a href="/foo/<span class="highlight">matchthisterm</span>/bar">bof</a>

The right way to do this is to use the correct HTML / XML parser (e.g. DOMDocument.loadHTML or simplehtmldom ), then scan and replace the contents of each node text separately. Finally, put the HTML back into the string.

An alternative to highlighting a search query is to use JavaScript. Since the browser has already parsed HTML in the DOM, this will save you from processing. See for example. this question is for example.

Combine multiple terms in <body> tags

More articles: