Java: HTML parsing

I have HTML content as below. The tag I'm looking for here is "img src" and "!important" . Does Java provide any HTML parsing methods?

 <fieldset> <table cellpadding='0'border='0'cellspacing='0'style="clear :both"> <tr valign='top' ><td width='35' > <a href='http://mypage.rediff.com/android/32868898'class='space' onmousedown="return enc(this,'http://track.rediff.com/clickurl=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F3 868898___&service=mypage_feeds&clientip=202.137.232.117&pos=0&feed_id=12942949154d255f839677925642&prc_id=32868898&rowid=2064549114')" > <div style='width:25px;height:25px;overflow:hidden;'> <img src='http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb' width='25' vspace='0' /></div></a></td> <td><span> <a href='http://mypage.rediff.com/android/32868898' class="space" onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.117&pos=0&feed_id=12942949154d255f839677925642&prc_id=32868898&rowid=2064549114')" >Android </a> </span><span style='color:#000000 !important;'>android se updates...</span><div class='divtext'></div></td></tr><tr><td height='5' ></td></tr></table></fieldset><br/> 
0
java html-parsing
source share
4 answers
 String value = Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("img").attr("src"); System.out.println(value); //http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb System.out.println(Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("span[style$=important;]").first().text());//android se updates... 
  • Jsoup
  • What-are-the-pros-and-cons-of-the-leading-java-html-parsers
+2
source share

Try NekoHtml . This is an HTML parsing library used by various high-level testing systems such as HtmlUnit.

NekoHTML is a simple HTML scanner and tag balancer that allows application programmers to parse HTML documents and access information using standard XML interfaces. The parser can scan HTML files and "fix" many common errors that authors (and the computer) create in writing in HTML documents. NekoHTML adds missing parent elements; Automatically closes items with optional end tags and can handle inconsistent inline element tags.

+1
source share

I used jsoup - this library has good selector syntax (http://jsoup.org/cookbook/extracting-data/selector -syntax), and for your problem you can use this code:

 File input = new File("input.html"); Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/"); Elements pngs = doc.select("img[src$=.png]"); 
+1
source share

I like to use Jericho: http://jericho.htmlparser.net/docs/index.html

It is invulnerable to poorly formed html, links leading to inaccessible places, etc.

There are many examples on their page, you just get all the IMG tags and parse their attributes to extract the ones that convey your needs.

+1
source share

All Articles