Jsoup - extract text

I need to extract text from node as follows:

<div> Some text <b>with tags</b> might go here. <p>Also there are paragraphs</p> More text can go without paragraphs<br/> </div> 

And I need to build:

 Some text <b>with tags</b> might go here. Also there are paragraphs More text can go without paragraphs 

Element.text returns only the entire contents of a div. Element.ownText - everything that is not included in the elements of children. Both are wrong. Iterating through children ignores text nodes.

Is there a way to iterate the contents of an element to get text nodes. For example.

  • Text node - some text
  • Node <b> - with tags
  • Text node - can go here.
  • Node <p> - Paragraphs Also Exist
  • Text node - More text can go without paragraphs
  • Node <br> - <empty>
+7
source share
4 answers

Element.children () returns an Elements object - an Element list. If you look at the parent class, Node , you will see methods that allow you to access arbitrary nodes, and not just elements, such as Node.childNodes () .

 public static void main(String[] args) throws IOException { String str = "<div>" + " Some text <b>with tags</b> might go here." + " <p>Also there are paragraphs</p>" + " More text can go without paragraphs<br/>" + "</div>"; Document doc = Jsoup.parse(str); Element div = doc.select("div").first(); int i = 0; for (Node node : div.childNodes()) { i++; System.out.println(String.format("%d %s %s", i, node.getClass().getSimpleName(), node.toString())); } } 

Result:

  1 TextNode 
  Some text 
 2 Element <b> with tags </b>
 3 TextNode might go here. 
 4 Element <p> Also there are paragraphs </p>
 5 TextNode More text can go without paragraphs
 6 Element <br/>
+11
source
 for (Element el : doc.select("body").select("*")) { for (TextNode node : el.textNodes()) { node.text() )); } } 
+3
source

Assuming you only need text (no tags), my solution is below.
Exit:
Some tagged texts may go here. There are also paragraphs. More text can go without paragraphs

 public static void main(String[] args) throws IOException { String str = "<div>" + " Some text <b>with tags</b> might go here." + " <p>Also there are paragraphs.</p>" + " More text can go without paragraphs<br/>" + "</div>"; Document doc = Jsoup.parse(str); Element div = doc.select("div").first(); StringBuilder builder = new StringBuilder(); stripTags(builder, div.childNodes()); System.out.println("Text without tags: " + builder.toString()); } /** * Strip tags from a List of type <code>Node</code> * @param builder StringBuilder : input and output * @param nodesList List of type <code>Node</code> */ public static void stripTags (StringBuilder builder, List<Node> nodesList) { for (Node node : nodesList) { String nodeName = node.nodeName(); if (nodeName.equalsIgnoreCase("#text")) { builder.append(node.toString()); } else { // recurse stripTags(builder, node.childNodes()); } } } 
+1
source

you can use TextNode for this purpose:

 List<TextNode> bodyTextNode = doc.getElementById("content").textNodes(); String html = ""; for(TextNode txNode:bodyTextNode){ html+=txNode.text(); } 
+1
source

All Articles