Regex selects all text between tags

What is the best way to select all text between two tags - for example: text between all the "pre" tags on a page.

+114
html regex html-parsing
Aug 23 2018-11-21T00:
source share
14 answers

You can use "<pre>(.*?)</pre>" , (replacing pre with any text you want) and extract the first group (language is indicated for more specific instructions), but this assumes a simplified idea of that you have very simple and valid HTML.

Like other commentators, if you are doing something complicated, use an HTML parser.

+139
Aug 23 '11 at 21:00
source share
— -

The tag can be completed on another line. This is why \n needs to be added.

 <PRE>(.|\n)*?<\/PRE> 
+114
Jun 02 '13 at 7:57
source share

This is what I would use.

 (?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|'~]| )+?(?=(</pre>)) 

Basically what he does:

(?<=(<pre>)) The selection should start with <pre>

(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|~]| ) This it’s just the regular expression that I want to apply. In this case, it selects a letter, number or newline or some special characters listed in square brackets in the example. The | symbol simply means “ OR ”.

+? Plus status symbols to select one or more of the above - the order does not matter. A question mark changes the default behavior from greedy to sloppy.

(?=(</pre>)) Selection must be added </pre>

enter image description here

Depending on your use case, you may need to add some modifiers, such as ( i or m )

  • I'm case insensitive
  • m - multiline search

Here I performed this search in Sublime Text, so I didn't have to use modifiers in my regex.

Javascript does not support rear view

The above example should work well with languages ​​such as PHP, Perl, Java ... Javascript, however, does not support lookbehind, so we should forget about using (?<=(<pre>)) and look for some workaround Maybe just remove the first four characters from our result for each choice, as here Regex, match the text between the tags

Also look at the JAVASCRIPT REGEX documentation for non-capturing brackets.

+17
Dec 01 '16 at 10:20
source share

use the template below to get the content between the elements. Replace [tag] actual element from which you want to extract the content.

 <[tag]>(.+?)</[tag]> 

Sometimes tags will have attributes, such as an anchor tag with href , then use the template below.

  <[tag][^>]*>(.+?)</[tag]> 
+15
Nov 11 '15 at 17:14
source share

You should not try to parse html with regular expressions to see this question and how it came about.

In simple terms, html is not an ordinary language, so you cannot parse it completely with regular expressions.

Having said that, you can parse html subsets if there are no similar nested tags. So, as long as there is something between and is not this tag, this will work:

 preg_match("/<([\w]+)[^>]*>(.*?)<\/\1>/", $subject, $matches); $matches = array ( [0] => full matched string [1] => tag name [2] => tag content ) 

It’s best to use a parser, such as the native DOMDocument, to load html, then select your tag and get an internal html that might look something like this:

 $obj = new DOMDocument(); $obj -> load($html); $obj -> getElementByTagName('el'); $value = $obj -> nodeValue(); 

And since this is the correct parser, it will be able to process nesting tags, etc.

+6
Aug 23 2018-11-21T00:
source share

To exclude markup tags:

 "(?<=<pre>)(.*?)(?=</pre>)" 
+5
Jul 04 '18 at 19:31
source share

This seems to be the simplest regular expression of everything I've found

 (?:<TAG>)([\s\S]*)(?:<\/TAG>) 
  1. Exclude opening tag (?:<TAG>) from matches
  2. Include any whitespace or non-white characters ([\s\S]*) in matches
  3. Exclude closing tag (?:<\/TAG>) from matches
+4
Aug 30 '18 at 9:19
source share

Try it....

 (?<=\<any_tag\>)(\s*.*\s*)(?=\<\/any_tag\>) 
+3
Oct. 23 '15 at 18:31
source share

 var str = "Lorem ipsum <pre>text 1</pre> Lorem ipsum <pre>text 2</pre>"; str.replace(/<pre>(.*?)<\/pre>/g, function(match, g1) { console.log(g1); }); 

Since the accepted answer is without javascript code, so adding:

+2
Aug 28 '17 at 1:12
source share

preg_match_all(/<pre>([^>]*?)<\/pre>/,$content,$matches) matches preg_match_all(/<pre>([^>]*?)<\/pre>/,$content,$matches) this regular expression will choose between tags. regardless of whether it is on a new line (work with multi-line.

+1
Oct. 16 '18 at 10:42
source share

For multiple lines:

 <htmltag>(.+)((\s)+(.+))+</htmltag> 
0
Nov 16 '16 at 19:10
source share

You can use Pattern pattern = Pattern.compile( "[^<'tagname'/>]" );

0
Feb 17 '17 at 15:10
source share

I am using this solution:

 preg_match_all( '/<((?!<)(.|\n))*?\>/si', $content, $new); var_dump($new); 
0
Nov 29 '17 at 14:50
source share
 <pre>([\r\n\s]*(?!<\w+.*[\/]*>).*[\r\n\s]*|\s*[\r\n\s]*)<code\s+(?:class="(\w+|\w+\s*.+)")>(((?!<\/code>)[\s\S])*)<\/code>[\r\n\s]*((?!<\w+.*[\/]*>).*|\s*)[\r\n\s]*<\/pre> 
-four
Feb 26 '16 at 23:04
source share



All Articles