I have this line containing a large html fragment and am trying to extract the link from the line href = "..." line. Href can be in one of the following forms:
<a href="..." /> <a class="..." href="..." />
I have no problem with regex, but for some reason, when I use the following code:
String innerHTML = getHTML(); Pattern p = Pattern.compile("href=\"(.*)\"", Pattern.DOTALL); Matcher m = p.matcher(innerHTML); if (m.find()) { // Get all groups for this match for (int i=0; i<=m.groupCount(); i++) { String groupStr = m.group(i); System.out.println(groupStr); } }
Can someone tell me what is wrong with my code? I did this in php, but in Java I somehow do something wrong ... What happens is that it prints the entire html line whenever I try to print it ...
EDIT: Just to let everyone know which line I'm dealing with:
<a class="Wrap" href="item.php?id=43241"><input type="button"> <span class="chevron"></span> </a> <div class="menu"></div>
Every time I run the code, it prints the whole line ... This is the problem ...
And about using jTidy ... I am on it, but it would be interesting to know what went wrong in this case ...
java html regex html-parsing
Legend
source share