How to parse an HTML document using JSoup to get a list of links?

I am trying to parse http://www.craigslist.org/about/sites to create typing / links for dynamically loading a program using this information. So far I have done this:

Document doc = Jsoup.connect("http://www.craigslist.org/about/sites").get(); Elements elms = doc.select("div.colmask"); // gets 7 countries 

Below this tag are the doc.select("div.state_delimiter,ul") tags doc.select("div.state_delimiter,ul") that I am trying to get. I set up my iterator and look out loud and call iterator.next().outerHtml(); . I see all the tags for each country.

How can I go through each div.state_delimiter , pull this text and then go down there exists </ul> , which defines the end of states of individual counties / cities: links / text?

I played with this and could do this by setting outerHtml() to String and then parsing the string manually, but I'm sure there is an easier way to do this. I tried text() and also tried attr("div.state_delimiter") , but I think I messed up the template / procedure to do it right. I wonder if someone can help me here and show me how to get div.state_delimiter in the text box, and then <ul><li></li></ul> I need everything <li></li> under <ul></ul> for each state. Looking at the http: // & && html capture, which combines with it as easily as possible.

+4
source share
1 answer

<ul> containing cities is the next brother <div class="state_delimiter"> . You can use Element#nextElementSibling() to grab it from this div. Here is an example run:

 Document document = Jsoup.connect("http://www.craigslist.org/about/sites").get(); Elements countries = document.select("div.colmask"); for (Element country : countries) { System.out.println("Country: " + country.select("h1.continent_header").text()); Elements states = country.select("div.state_delimiter"); for (Element state : states) { System.out.println("\tState: " + state.text()); Elements cities = state.nextElementSibling().select("li"); for (Element city : cities) { System.out.println("\t\tCity: " + city.text()); } } } 

doc.select("div.state_delimiter,ul") does not do what you want. It returns all the elements <div class="state_delimiter"> and <ul> document. Manually parsing it using string functions does not make sense if you already have an HTML parser in hand.

+7
source

All Articles