Regex <img> Parsing tags with src, width, height

You can respond by saying that H TML Parsing using regex is an absolutely bad idea , like this , and you're right.

But in my case, our next html node is created by our own server, so we know that it will always look like this, and since the regular expression will be in the mobile android library , I do not want to use the Jsoup library.

What I want to parse : <img src="myurl.jpg" width="12" height="32">

What you need to analyze :

  • match the regular img tag and group the value of the src attribute: <img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>
  • the values ​​of the width and height attributes: (width|height)\s*=\s*['"]([^'"]*)['"]*

So, the first regular expression will have # 1 group with the img url, and the second regular expression will have two matches with subgroups of their values.

How can I combine both?

Required Conclusion:

  • img url
  • width value
  • height value
+5
source share
3 answers

To match any img tag with src , height and width attributes, which can be in any order and which are actually optional, you can use

 "(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^>]*?)\\3" 

Watch a demo of regex and IDEONE Java demo

 String s = "<img height=\"132\" src=\"NEW_myurl.jpg\" width=\"112\"><link src=\"/test/test.css\"/><img src=\"myurl.jpg\" width=\"12\" height=\"32\">"; Pattern pattern = Pattern.compile("(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^\"]*)\\3"); Matcher matcher = pattern.matcher(s); while (matcher.find()){ if (!matcher.group(1).isEmpty()) { // We have a new IMG tag System.out.println("\n--- NEW MATCH ---"); } System.out.println(matcher.group(2) + ": " + matcher.group(4)); } 

Regular expression details:

  • (<img\\b|(?!^)\\G) - the initial boundary corresponding to the beginning of the <img> or the end of the previous successful match
  • [^>]*? - match any optional attributes that we are not interested in (0+ characters except > to stay inside the tag) - \\b(src|width|height)= - the whole word src= , width= or height=
  • ([\"']?) - technical 3rd group for checking the attribute value separator
  • ([^>]*?) - Group 4 containing the attribute value (0 + characters other than > , as small as possible before the first
  • \\3 - attribute value separator corresponding to group 3 ( NOTE , if the separator can be empty, add (?=\\s|/?>) At the end of the pattern)

Logics:

  • Corresponds to the beginning of the img tag
  • Then match everything inside, but just remove the attributes we need.
  • Since we are going to have multiple matches, not groups, we need to find the border for each new img tag. This is done by checking that the first group is not empty ( if (!matcher.group(1).isEmpty()) )
  • It remains only to add a list for a match.
+2
source

You may need the following:

 "(?i)(src|width|height)=\"(.*?)\"" 

Strike>


Update:

I misunderstood your question, you need something like:

 "(?i)<img\\s+src=\"(.*?)\"\\s+width=\"(.*?)\"\\s+height=\"(.*?)\">" 

Regex101 demo


Update 2

The following expression will capture the attributes of the img tag in any order:

 "(?i)(?><img\\s+)src=\"(.*?)\"|width=\"(.*?)\"|height=\"(.*?)\">" 

Regex101 Demo v2

+1
source

If you want to combine both questions, here is the answer.

 <img\s+src="([^"]+)"\s+width="([^"]+)"\s+height="([^"]+)" 

the sample that I tested

 <img src="rakesh.jpg" width="25" height="45"> 

try it

0
source

All Articles