Extract text between two <hr> tags in CSS-less HTML

Question

Extract text between two <hr> tags in CSS-less HTML

Using Jsoup, what would be the best approach for retrieving the text of which its template is known ( [number]%%[number] ), but it is on an HTML page that uses no CSS, no divs, spanans, classes or other any type of identification (yup, old HTML page that I don't control)?

The only thing that sequentially identifies this text segment (and is guaranteed to remain that way) is that HTML always looks like this (within most of HTML):

 <hr> 2%%17 <hr>

(The numbers 2 and 17 are only examples. They can be any numbers, and in fact these are two variables that I need to reliably extract from this HTML page).

If this text was within the attachment and uniquely identified by <span> or <div> , I would have no problem extracting it using Jsoup. The problem is that this is not the case, and the only way I can think now (which is not at all elegant) is to process HTML raw through a regular expression.

Processing raw HTML through regex seems inefficient because I already tested it through Jsoup in the DOM.

Suggestions?

+2

java html-parsing jsoup

ef2011 Sep 2 '11 at 23:10

source share

1 answer

Balusc · Accepted Answer · 2011-09-02T23:53:55+0000

How about this?

 Document document = Jsoup.connect(url).get(); Elements hrs = document.select("hr"); Pattern pattern = Pattern.compile("(\\d+%%\\d+)"); for (Element hr : hrs) { String textAfterHr = hr.nextSibling().toString(); Matcher matcher = pattern.matcher(textAfterHr); while (matcher.find()) { System.out.println(matcher.group(1)); // <-- There, your data. } }

Extract text between two <hr> tags in CSS-less HTML

More articles: