Regex <img> Parsing tags with src, width, height
You can respond by saying that H TML Parsing using regex is an absolutely bad idea , like this , and you're right.
But in my case, our next html node is created by our own server, so we know that it will always look like this, and since the regular expression will be in the mobile android library , I do not want to use the Jsoup library.
What I want to parse : <img src="myurl.jpg" width="12" height="32">
What you need to analyze :
- match the regular img tag and group the value of the src attribute:
<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*> - the values ββof the width and height attributes:
(width|height)\s*=\s*['"]([^'"]*)['"]*
So, the first regular expression will have # 1 group with the img url, and the second regular expression will have two matches with subgroups of their values.
How can I combine both?
Required Conclusion:
- img url
- width value
- height value
To match any img tag with src , height and width attributes, which can be in any order and which are actually optional, you can use
"(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^>]*?)\\3" Watch a demo of regex and IDEONE Java demo
String s = "<img height=\"132\" src=\"NEW_myurl.jpg\" width=\"112\"><link src=\"/test/test.css\"/><img src=\"myurl.jpg\" width=\"12\" height=\"32\">"; Pattern pattern = Pattern.compile("(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^\"]*)\\3"); Matcher matcher = pattern.matcher(s); while (matcher.find()){ if (!matcher.group(1).isEmpty()) { // We have a new IMG tag System.out.println("\n--- NEW MATCH ---"); } System.out.println(matcher.group(2) + ": " + matcher.group(4)); } Regular expression details:
(<img\\b|(?!^)\\G)- the initial boundary corresponding to the beginning of the<img>or the end of the previous successful match[^>]*?- match any optional attributes that we are not interested in (0+ characters except>to stay inside the tag) -\\b(src|width|height)=- the whole wordsrc=,width=orheight=([\"']?)- technical 3rd group for checking the attribute value separator([^>]*?)- Group 4 containing the attribute value (0 + characters other than>, as small as possible before the first\\3- attribute value separator corresponding to group 3 ( NOTE , if the separator can be empty, add(?=\\s|/?>)At the end of the pattern)
Logics:
- Corresponds to the beginning of the
imgtag - Then match everything inside, but just remove the attributes we need.
- Since we are going to have multiple matches, not groups, we need to find the border for each new
imgtag. This is done by checking that the first group is not empty (if (!matcher.group(1).isEmpty())) - It remains only to add a list for a match.
You may need the following:
"(?i)(src|width|height)=\"(.*?)\"" Strike>
Update:
I misunderstood your question, you need something like:
"(?i)<img\\s+src=\"(.*?)\"\\s+width=\"(.*?)\"\\s+height=\"(.*?)\">" Update 2
The following expression will capture the attributes of the img tag in any order:
"(?i)(?><img\\s+)src=\"(.*?)\"|width=\"(.*?)\"|height=\"(.*?)\">"