Java: I have a large html string and you need to extract the text href = "..."

Question

Java: I have a large html string and you need to extract the text href = "..."

I have this line containing a large html fragment and am trying to extract the link from the line href = "..." line. Href can be in one of the following forms:

<a href="..." /> <a class="..." href="..." />

I have no problem with regex, but for some reason, when I use the following code:

  String innerHTML = getHTML(); Pattern p = Pattern.compile("href=\"(.*)\"", Pattern.DOTALL); Matcher m = p.matcher(innerHTML); if (m.find()) { // Get all groups for this match for (int i=0; i<=m.groupCount(); i++) { String groupStr = m.group(i); System.out.println(groupStr); } }

Can someone tell me what is wrong with my code? I did this in php, but in Java I somehow do something wrong ... What happens is that it prints the entire html line whenever I try to print it ...

EDIT: Just to let everyone know which line I'm dealing with:

 <a class="Wrap" href="item.php?id=43241"><input type="button"> <span class="chevron"></span> </a> <div class="menu"></div>

Every time I run the code, it prints the whole line ... This is the problem ...

And about using jTidy ... I am on it, but it would be interesting to know what went wrong in this case ...

+6

java html regex html-parsing

Legend Nov 03 '09 at 10:35

source share

7 answers

Regex is great, but not suitable for this purpose. Usually you want to use a stack-based analyzer for this. Take a look at the Java HTML parser API like jTidy .

+5

Balusc Nov 03 '09 at 22:45

source share

There are two problems with the code you posted:

First,. .* In your regular expression is greedy. This will match all characters to the last character " that can be found. You can make this match inanimate by changing it to .*?

Secondly, in order to get all the matches, you need to keep repeating with Matcher.find , rather than looking for groups. Groups give you access to each section in regular expression brackets. However, you search every time the entire regular expression matches.

Combining them together, you get the following code, which should do what you need:

 Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.DOTALL); Matcher m = p.matcher(innerHTML); while (m.find()) { System.out.println(m.group(1)); }

+4

Phil ross Nov 03 '09 at 10:48

source share

Use the built-in parser. Something like:

  EditorKit kit = new HTMLEditorKit(); HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument(); doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE); kit.read(reader, doc, 0); HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A); while (it.isValid()) { SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes(); String href = (String)s.getAttribute(HTML.Attribute.HREF); System.out.println( href ); it.next(); }

Or use ParserCallback:

 import java.io.*; import java.net.*; import javax.swing.text.*; import javax.swing.text.html.parser.*; import javax.swing.text.html.*; public class ParserCallbackText extends HTMLEditorKit.ParserCallback { public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos) { if (tag.equals(HTML.Tag.A)) { String href = (String)a.getAttribute(HTML.Attribute.HREF); System.out.println(href); } } public static void main(String[] args) throws Exception { Reader reader = getReader(args[0]); ParserCallbackText parser = new ParserCallbackText(); new ParserDelegator().parse(reader, parser, true); } static Reader getReader(String uri) throws IOException { // Retrieve from Internet. if (uri.startsWith("http:")) { URLConnection conn = new URL(uri).openConnection(); return new InputStreamReader(conn.getInputStream()); } // Retrieve from file. else { return new FileReader(uri); } } }

The reader may be a StringReader.

+4

camickr Nov 03 '09 at 23:26

source share

Another easy and reliable way to do this is with Jsoup.

 Document doc = Jsoup.connect("http://example.com/").get(); Elements links = doc.select("a[href]"); for (Element link : links){ System.out.println(link.attr("abs:href")); }

+3

surajz Dec 31 '11 at 1:53

source share

you can use html parser library. jtidy , for example, gives you a html DOM model from which you can extract all the elements of "a" and read their "href" attribute

+2

Lorenzo boccaccia Nov 03 '09 at 10:51

source share

"href=\"(.*?)\"" should also work, but I think Kugel’s answer will work faster.

+1

Denis tulskiy Nov 03 '09 at 10:46

source share

Kugel · Accepted Answer · 2009-11-03T22:42:17+0000

.*

This is a greedy operation that will accept any character, including quotation marks.

Try something like:

 "href=\"([^\"]*)\""

Java: I have a large html string and you need to extract the text href = "..."

More articles: