Regex selects all text between tags

Question

Regex selects all text between tags

What is the best way to select all text between two tags - for example: text between all the "pre" tags on a page.

+114

html regex html-parsing

basheps Aug 23 2018-11-21T00:

source share

14 answers

The tag can be completed on another line. This is why \n needs to be added.

 <PRE>(.|\n)*?<\/PRE>

+114

zac Jun 02 '13 at 7:57

source share

This is what I would use.

 (?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|'~]| )+?(?=(</pre>))

Basically what he does:

(?<=(<pre>)) The selection should start with <pre>

(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|~]| ) This it’s just the regular expression that I want to apply. In this case, it selects a letter, number or newline or some special characters listed in square brackets in the example. The | symbol simply means “ OR ”.

+? Plus status symbols to select one or more of the above - the order does not matter. A question mark changes the default behavior from greedy to sloppy.

(?=(</pre>)) Selection must be added </pre>

Depending on your use case, you may need to add some modifiers, such as ( i or m )

I'm case insensitive
m - multiline search

Here I performed this search in Sublime Text, so I didn't have to use modifiers in my regex.

Javascript does not support rear view

The above example should work well with languages such as PHP, Perl, Java ... Javascript, however, does not support lookbehind, so we should forget about using (?<=(<pre>)) and look for some workaround Maybe just remove the first four characters from our result for each choice, as here Regex, match the text between the tags

Also look at the JAVASCRIPT REGEX documentation for non-capturing brackets.

+17

DevWL Dec 01 '16 at 10:20

source share

use the template below to get the content between the elements. Replace [tag] actual element from which you want to extract the content.

 <[tag]>(.+?)</[tag]>

Sometimes tags will have attributes, such as an anchor tag with href , then use the template below.

  <[tag][^>]*>(.+?)</[tag]>

+15

Shravan Ramamurthy Nov 11 '15 at 17:14

source share

You should not try to parse html with regular expressions to see this question and how it came about.

In simple terms, html is not an ordinary language, so you cannot parse it completely with regular expressions.

Having said that, you can parse html subsets if there are no similar nested tags. So, as long as there is something between and is not this tag, this will work:

 preg_match("/<([\w]+)[^>]*>(.*?)<\/\1>/", $subject, $matches); $matches = array ( [0] => full matched string [1] => tag name [2] => tag content )

It’s best to use a parser, such as the native DOMDocument, to load html, then select your tag and get an internal html that might look something like this:

 $obj = new DOMDocument(); $obj -> load($html); $obj -> getElementByTagName('el'); $value = $obj -> nodeValue();

And since this is the correct parser, it will be able to process nesting tags, etc.

+6

sg3s Aug 23 2018-11-21T00:

source share

To exclude markup tags:

 "(?<=<pre>)(.*?)(?=</pre>)"

+5

Jean-Simon Collard Jul 04 '18 at 19:31

source share

This seems to be the simplest regular expression of everything I've found

 (?:<TAG>)([\s\S]*)(?:<\/TAG>)

Exclude opening tag (?:<TAG>) from matches
Include any whitespace or non-white characters ([\s\S]*) in matches
Exclude closing tag (?:<\/TAG>) from matches

+4

maqduni Aug 30 '18 at 9:19

source share

Try it....

 (?<=\<any_tag\>)(\s*.*\s*)(?=\<\/any_tag\>)

+3

Heriberto Rivera Oct. 23 '15 at 18:31

source share

 var str = "Lorem ipsum <pre>text 1</pre> Lorem ipsum <pre>text 2</pre>"; str.replace(/<pre>(.*?)<\/pre>/g, function(match, g1) { console.log(g1); });

Since the accepted answer is without javascript code, so adding:

+2

Shishir Arora Aug 28 '17 at 1:12

source share

preg_match_all(/<pre>([^>]*?)<\/pre>/,$content,$matches) matches preg_match_all(/<pre>([^>]*?)<\/pre>/,$content,$matches) this regular expression will choose between tags. regardless of whether it is on a new line (work with multi-line.

+1

Krishna thakor Oct. 16 '18 at 10:42

source share

For multiple lines:

 <htmltag>(.+)((\s)+(.+))+</htmltag>

0

Dilip Nov 16 '16 at 19:10

source share

You can use Pattern pattern = Pattern.compile( "[^<'tagname'/>]" );

0

Ambrish Rajput Feb 17 '17 at 15:10

source share

I am using this solution:

 preg_match_all( '/<((?!<)(.|\n))*?\>/si', $content, $new); var_dump($new);

0

T.Todua Nov 29 '17 at 14:50

source share

 <pre>([\r\n\s]*(?!<\w+.*[\/]*>).*[\r\n\s]*|\s*[\r\n\s]*)<code\s+(?:class="(\w+|\w+\s*.+)")>(((?!<\/code>)[\s\S])*)<\/code>[\r\n\s]*((?!<\w+.*[\/]*>).*|\s*)[\r\n\s]*<\/pre>

-four

user5988518 Feb 26 '16 at 23:04

source share

PyKing · Accepted Answer · 2011-08-23 21:00

You can use "<pre>(.*?)</pre>" , (replacing pre with any text you want) and extract the first group (language is indicated for more specific instructions), but this assumes a simplified idea of that you have very simple and valid HTML.

Like other commentators, if you are doing something complicated, use an HTML parser.

Regex selects all text between tags

This is what I would use.

Javascript does not support rear view

More articles: