Is the value (\ /?) In the regular expression / is (\ w +) ([^>] *?) Redundancy?

this regular expression should match the html start tag, I think.

var results = html.match(/<(\/?)(\w+)([^>]*?)>/);

I see that he must first capture < , but then I am confused by what this capture does (\/?) . Am I reasoning correctly that ([^>]*?)> Searches for every character except>> = 0 times? If so, why do you need a capture (\w+) ? Is this not in scope [^>]*?

+7
javascript regex
source share
5 answers

Take a marker with a token:

  • / begin regex literal
  • < match literal <
  • (\/?) matches 0 or 1 ( ? ) literal / , which is escaped \
  • (\w+) matches one or more word characters
  • ([^>]*?) lazily * matches zero or more ( *? ) of everything that is not >
  • > matches literal >
  • / end regex literal

lazily * - adding "?" after the repetition quantifier makes it lazy, which means that the regular expression will match the previous character of the minimum number of times. See the documentation.

So essentially this regular expression will match β€œ<”, followed by β€œ/”, followed by any number of letters, numbers or underscores, followed by anything that is not β€œ>”, and finally a ">".

Moreover, the token (\w+) not redundant, since it ensures the presence of at least one character of the word between < and > .

Remember that trying to parse HTML with regular expressions is usually a bad idea .

+4
source share

Using the power of debuggex to create an image :)

 <(\/?)(\w+)([^>]*?)> 

Will be evaluated as follows

Regular expression image

Change live in Debuggex

As you can see, it corresponds to HTML tags (opening and closing tags). The regular expression contains three capture groups that capture the following:

  • (\/?) existence of / (this is the closing tag, if present)
  • (\w+) tag name
  • ([^>]*?) everything else until the tag is closed (for example, attributes)

So it matches <a href="#"> . Interestingly, it does not match <a data-fun="fun>nofun"> correctly, because it stops at > in the data-fun attribute. Although (I think) > valid in the value of the attribute .

Another funny thing: capturing name tags does not contain all theoretically valid XHTML tags. XHTML Lets Letter | Digit | '.' | '-' | '_' | ':' | .. Letter | Digit | '.' | '-' | '_' | ':' | .. Letter | Digit | '.' | '-' | '_' | ':' | .. (source: XHTML specification ). (\w+) , however, does not match . , - and : This imaginary <.foobar> tag will not match this regular expression. However, this should not have any real impact on life.

You see that parsing HTML using RgExes is risky. You might be better off with an HTML parser.

+4
source share

(\/?) matches and catches any closed tag, for example </i> , or </strong> if you are familiar with them?

One more note: \w really is a character class [a-zA-Z_\d] , so other characters, such as = , " , etc., do not match, and, nevertheless, will match [^>] And yes, you are right in that bit.

+3
source share

To answer your last question, (\w+) and ([^>]*?) Are not redundant. Both of them perform important functions in the expression.

This expression finds start or end tags.

(\/?) matches a / , but ? makes it optional.

(\w+) matches the word characters intended to match the tag name here.

([^>]*?) intended to match attributes.

So, if you have the line <div class="text"> ,

(\w+) in the expression will match the div , and ([^>]*?) will match class="text"

+2
source share

Demo (in ruby, not in javascript, but it doesn’t matter): http://www.rubular.com/r/bhw2O28qUr

To summarize, it captures the end tags.

0
source share

All Articles