Java: HTML parsing

Question

Java: HTML parsing

I have HTML content as below. The tag I'm looking for here is "img src" and "!important" . Does Java provide any HTML parsing methods?

 <fieldset> <table cellpadding='0'border='0'cellspacing='0'style="clear :both"> <tr valign='top' ><td width='35' > <a href='http://mypage.rediff.com/android/32868898'class='space' onmousedown="return enc(this,'http://track.rediff.com/clickurl=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F3 868898___&service=mypage_feeds&clientip=202.137.232.117&pos=0&feed_id=12942949154d255f839677925642&prc_id=32868898&rowid=2064549114')" > <div style='width:25px;height:25px;overflow:hidden;'> <img src='http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb' width='25' vspace='0' /></div></a></td> <td><span> <a href='http://mypage.rediff.com/android/32868898' class="space" onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.117&pos=0&feed_id=12942949154d255f839677925642&prc_id=32868898&rowid=2064549114')" >Android </a> </span><span style='color:#000000 !important;'>android se updates...</span><div class='divtext'></div></td></tr><tr><td height='5' ></td></tr></table></fieldset><br/>

0

java html-parsing

Faheem kalsekar Jan 6 '11 at 11:04

source share

4 answers

Try NekoHtml . This is an HTML parsing library used by various high-level testing systems such as HtmlUnit.

NekoHTML is a simple HTML scanner and tag balancer that allows application programmers to parse HTML documents and access information using standard XML interfaces. The parser can scan HTML files and "fix" many common errors that authors (and the computer) create in writing in HTML documents. NekoHTML adds missing parent elements; Automatically closes items with optional end tags and can handle inconsistent inline element tags.

+1

skaffman Jan 6 '11 at 11:08

source share

I used jsoup - this library has good selector syntax (http://jsoup.org/cookbook/extracting-data/selector -syntax), and for your problem you can use this code:

 File input = new File("input.html"); Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/"); Elements pngs = doc.select("img[src$=.png]");

+1

Igor Jan 6 '11 at 11:20

source share

I like to use Jericho: http://jericho.htmlparser.net/docs/index.html

It is invulnerable to poorly formed html, links leading to inaccessible places, etc.

There are many examples on their page, you just get all the IMG tags and parse their attributes to extract the ones that convey your needs.

+1

Folkslord Jan 21 '11 at 18:40

source share

Jigar joshi · Accepted Answer · 2011-01-06T11:09:21+0000

 String value = Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("img").attr("src"); System.out.println(value); //http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb System.out.println(Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("span[style$=important;]").first().text());//android se updates...

Jsoup
What-are-the-pros-and-cons-of-the-leading-java-html-parsers

Java: HTML parsing

More articles: