Jsoup - extract text

Question

Jsoup - extract text

I need to extract text from node as follows:

<div> Some text <b>with tags</b> might go here. <p>Also there are paragraphs</p> More text can go without paragraphs<br/> </div>

And I need to build:

 Some text <b>with tags</b> might go here. Also there are paragraphs More text can go without paragraphs

Element.text returns only the entire contents of a div. Element.ownText - everything that is not included in the elements of children. Both are wrong. Iterating through children ignores text nodes.

Is there a way to iterate the contents of an element to get text nodes. For example.

Text node - some text
Node <b> - with tags
Text node - can go here.
Node <p> - Paragraphs Also Exist
Text node - More text can go without paragraphs
Node <br> - <empty>

+7

java iteration jsoup text-extraction

Eugene retunsky Apr 16 '12 at 16:19

source share

4 answers

 for (Element el : doc.select("body").select("*")) { for (TextNode node : el.textNodes()) { node.text() )); } }

+3

Charles Aug 13 '13 at 21:10

source share

Assuming you only need text (no tags), my solution is below.
Exit:
Some tagged texts may go here. There are also paragraphs. More text can go without paragraphs

 public static void main(String[] args) throws IOException { String str = "<div>" + " Some text <b>with tags</b> might go here." + " <p>Also there are paragraphs.</p>" + " More text can go without paragraphs<br/>" + "</div>"; Document doc = Jsoup.parse(str); Element div = doc.select("div").first(); StringBuilder builder = new StringBuilder(); stripTags(builder, div.childNodes()); System.out.println("Text without tags: " + builder.toString()); } /** * Strip tags from a List of type <code>Node</code> * @param builder StringBuilder : input and output * @param nodesList List of type <code>Node</code> */ public static void stripTags (StringBuilder builder, List<Node> nodesList) { for (Node node : nodesList) { String nodeName = node.nodeName(); if (nodeName.equalsIgnoreCase("#text")) { builder.append(node.toString()); } else { // recurse stripTags(builder, node.childNodes()); } } }

+1

John zoetebier Dec 16 '14 at 20:21

source share

you can use TextNode for this purpose:

 List<TextNode> bodyTextNode = doc.getElementById("content").textNodes(); String html = ""; for(TextNode txNode:bodyTextNode){ html+=txNode.text(); }

+1

Haydar ghasemi Jul 21 '15 at 18:41

source share

Vadim ponomarev · Accepted Answer · 2012-04-16T20:45:27+0000

Element.children () returns an Elements object - an Element list. If you look at the parent class, Node , you will see methods that allow you to access arbitrary nodes, and not just elements, such as Node.childNodes () .

 public static void main(String[] args) throws IOException { String str = "<div>" + " Some text <b>with tags</b> might go here." + " <p>Also there are paragraphs</p>" + " More text can go without paragraphs<br/>" + "</div>"; Document doc = Jsoup.parse(str); Element div = doc.select("div").first(); int i = 0; for (Node node : div.childNodes()) { i++; System.out.println(String.format("%d %s %s", i, node.getClass().getSimpleName(), node.toString())); } }

Result:

  1 TextNode 
  Some text 
 2 Element <b> with tags </b>
 3 TextNode might go here. 
 4 Element <p> Also there are paragraphs </p>
 5 TextNode More text can go without paragraphs
 6 Element <br/>

Jsoup - extract text

More articles: