How to save line breaks when using jsoup to convert html to plain text?

Question

How to save line breaks when using jsoup to convert html to plain text?

I have the following code:

public class NewClass { public String noTags(String str){ return Jsoup.parse(str).text(); } public static void main(String args[]) { String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" + "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> "; NewClass text = new NewClass(); System.out.println((text.noTags(strings))); }

And I have the result:

 hello world yo googlez

But I want to break the line:

 hello world yo googlez

I looked through jsoup TextNode # getWholeText () , but I cannot figure out how to use it.

If there is a <br> in the markup that I am analyzing, how can I get a line break in my resulting output?

+94

java jsoup

Billy Apr 12 2018-11-11T00:

source share

15 answers

user121196 · Answer 1 · 2013-10-26 02:57

The real solution that keeps line breaks should be like this:

 public static String br2nl(String html) { if(html==null) return html; Document document = Jsoup.parse(html); document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing document.select("br").append("\\n"); document.select("p").prepend("\\n\\n"); String s = document.html().replaceAll("\\\\n", "\n"); return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false)); }

It satisfies the following requirements:

if the source html contains a new line (\ n), it is saved
if the source html contains br or p-tags, they are translated to a new line (\ n).

Mirco Attocchi · Answer 2 · 2011-05-17 13:26

FROM

 Jsoup.parse("A\nB").text();

you have a way out

 "AB"

but not

A B

For this, I use:

 descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text(); text = descrizione.replaceAll("br2n", "\n");

Paulius Z · Answer 3 · 2013-04-23 16:46

 Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));

We use this method here:

 public static String clean(String bodyHtml, String baseUri, Whitelist whitelist, Document.OutputSettings outputSettings)

Whitelist.none() it with Whitelist.none() , we will make sure that all the HTML is removed.

By conducting new OutputSettings().prettyPrint(false) , we guarantee that the output is not reformatted and that line breaks are saved.

mkowa · Answer 4 · 2013-06-24 15:42

Try this with jsoup:

 public static String cleanPreserveLineBreaks(String bodyHtml) { // get pretty printed html with preserved br and p tags String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true)); // get plain text with preserved line breaks by disabled prettyPrint return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false)); }

popcorny · Answer 5 · 2013-08-01 08:53

You can move this item.

 public String convertNodeToText(Element element) { final StringBuilder buffer = new StringBuilder(); new NodeTraversor(new NodeVisitor() { boolean isNewline = true; @Override public void head(Node node, int depth) { if (node instanceof TextNode) { TextNode textNode = (TextNode) node; String text = textNode.text().replace('\u00A0', ' ').trim(); if(!text.isEmpty()) { buffer.append(text); isNewline = false; } } else if (node instanceof Element) { Element element = (Element) node; if (!isNewline) { if((element.isBlock() || element.tagName().equals("br"))) { buffer.append("\n"); isNewline = true; } } } } @Override public void tail(Node node, int depth) { } }).traverse(element); return buffer.toString(); }

And for your code

 String result = convertNodeToText(JSoup.parse(html))

zeenosaur · Answer 6 · 2018-05-17 14:04

In Jsoup v1.11.2 we can now use Element.wholeText() .

Code example:

 String cleanString = Jsoup.parse(htmlString).wholeText();

user121196's answer still works. But wholeText() preserves text alignment.

Green Beret · Answer 7 · 2014-07-24 04:53

 text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text(); text = descrizione.replaceAll("br2n", "\n");

works if html itself does not contain "br2n"

So,

 text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "<pre>\n</pre>")).text();

It works more reliably and easier.

abdolence · Answer 8 · 2016-06-05 12:59

This is my version of translating html to text (modified version of user121196 answer, actually).

This not only saves line breaks, but also formats text and removes excessive line breaks escaping HTML characters, and you will get much better results from your HTML (in my case, I get it from mail).

It was originally written in Scala, but you can easily change it in Java

 def html2text( rawHtml : String ) : String = { val htmlDoc = Jsoup.parseBodyFragment( rawHtml, "/" ) htmlDoc.select("br").append("\\nl") htmlDoc.select("div").prepend("\\nl").append("\\nl") htmlDoc.select("p").prepend("\\nl\\nl").append("\\nl\\nl") org.jsoup.parser.Parser.unescapeEntities( Jsoup.clean( htmlDoc.html(), "", Whitelist.none(), new org.jsoup.nodes.Document.OutputSettings().prettyPrint(true) ),false ). replaceAll("\\\\nl", "\n"). replaceAll("\r",""). replaceAll("\n\\s+\n","\n"). replaceAll("\n\n+","\n\n"). trim() }

Malcolm Smith · Answer 9 · 2017-05-19 08:21

Based on other answers and comments on this issue, it seems that most people coming here are really looking for a common solution that will provide a well-formatted textual representation of the HTML document. I know that I was.

Fortunately, JSoup already provides a pretty detailed example of how to achieve this: HtmlToPlainText.java

The FormattingVisitor example can be easily customized to your needs and can handle most block elements and line wrappers.

To avoid rotting links, here Jonathan Headley is fully:

 package org.jsoup.examples; import org.jsoup.Jsoup; import org.jsoup.helper.StringUtil; import org.jsoup.helper.Validate; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.nodes.Node; import org.jsoup.nodes.TextNode; import org.jsoup.select.Elements; import org.jsoup.select.NodeTraversor; import org.jsoup.select.NodeVisitor; import java.io.IOException; /** * HTML to plain-text. This example program demonstrates the use of jsoup to convert HTML input to lightly-formatted * plain-text. That is divergent from the general goal of jsoup .text() methods, which is to get clean data from a * scrape. * <p> * Note that this is a fairly simplistic formatter -- for real world use you'll want to embrace and extend. * </p> * <p> * To invoke from the command line, assuming you've downloaded the jsoup jar to your current directory:</p> * <p><code>java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]</code></p> * where <i>url</i> is the URL to fetch, and <i>selector</i> is an optional CSS selector. * * @author Jonathan Hedley, jonathan@hedley.net */ public class HtmlToPlainText { private static final String userAgent = "Mozilla/5.0 (jsoup)"; private static final int timeout = 5 * 1000; public static void main(String... args) throws IOException { Validate.isTrue(args.length == 1 || args.length == 2, "usage: java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]"); final String url = args[0]; final String selector = args.length == 2 ? args[1] : null; // fetch the specified URL and parse to a HTML DOM Document doc = Jsoup.connect(url).userAgent(userAgent).timeout(timeout).get(); HtmlToPlainText formatter = new HtmlToPlainText(); if (selector != null) { Elements elements = doc.select(selector); // get each element that matches the CSS selector for (Element element : elements) { String plainText = formatter.getPlainText(element); // format that element to plain text System.out.println(plainText); } } else { // format the whole doc String plainText = formatter.getPlainText(doc); System.out.println(plainText); } } /** * Format an Element to plain-text * @param element the root element to format * @return formatted text */ public String getPlainText(Element element) { FormattingVisitor formatter = new FormattingVisitor(); NodeTraversor traversor = new NodeTraversor(formatter); traversor.traverse(element); // walk the DOM, and call .head() and .tail() for each node return formatter.toString(); } // the formatting rules, implemented in a breadth-first DOM traverse private class FormattingVisitor implements NodeVisitor { private static final int maxWidth = 80; private int width = 0; private StringBuilder accum = new StringBuilder(); // holds the accumulated text // hit when the node is first seen public void head(Node node, int depth) { String name = node.nodeName(); if (node instanceof TextNode) append(((TextNode) node).text()); // TextNodes carry all user-readable text in the DOM. else if (name.equals("li")) append("\n * "); else if (name.equals("dt")) append(" "); else if (StringUtil.in(name, "p", "h1", "h2", "h3", "h4", "h5", "tr")) append("\n"); } // hit when all of the node children (if any) have been visited public void tail(Node node, int depth) { String name = node.nodeName(); if (StringUtil.in(name, "br", "dd", "dt", "p", "h1", "h2", "h3", "h4", "h5")) append("\n"); else if (name.equals("a")) append(String.format(" <%s>", node.absUrl("href"))); } // appends text to the string builder with a simple word wrap method private void append(String text) { if (text.startsWith("\n")) width = 0; // reset counter if starts with a newline. only from formats above, not in natural text if (text.equals(" ") && (accum.length() == 0 || StringUtil.in(accum.substring(accum.length() - 1), " ", "\n"))) return; // don't accumulate long runs of empty spaces if (text.length() + width > maxWidth) { // won't fit, needs to wrap String words[] = text.split("\\s+"); for (int i = 0; i < words.length; i++) { String word = words[i]; boolean last = i == words.length - 1; if (!last) // insert a space if not the last word word = word + " "; if (word.length() + width > maxWidth) { // wrap and reset counter accum.append("\n").append(word); width = word.length(); } else { accum.append(word); width += word.length(); } } } else { // fits as is, without need to wrap text accum.append(text); width += text.length(); } } @Override public String toString() { return accum.toString(); } } }

Abhay Gupta · Answer 10 · 2017-09-08 19:38

Try this with jsoup:

  doc.outputSettings(new OutputSettings().prettyPrint(false)); //select all <br> tags and append \n after that doc.select("br").after("\\n"); //select all <p> tags and prepend \n before that doc.select("p").before("\\n"); //get the HTML from the document, and retaining original new lines String str = doc.html().replaceAll("\\\\n", "\n");

Andy Res · Answer 11 · 2017-09-21 12:49

For more complex HTML, none of the above solutions work correctly; I was able to successfully perform the conversion while maintaining line breaks with:

 Document document = Jsoup.parse(myHtml); String text = new HtmlToPlainText().getPlainText(document);

(version 1.10.3)

manji · Answer 12 · 2011-04-12 20:08

Try the following:

 public String noTags(String str){ Document d = Jsoup.parse(str); TextNode tn = new TextNode(d.body().html(), ""); return tn.getWholeText(); }

Michael Bar-Sinai · Answer 13 · 2013-09-18 17:02

Use textNodes() to get a list of text nodes. Then connect them to \n as a delimiter. Here is some scala code that I use for this, the java port should be easy:

 val rawTxt = doc.body().getElementsByTag("div").first.textNodes() .asScala.mkString("<br />\n")

Chris6647 · Answer 14 · 2014-01-25 18:48

 /** * Recursive method to replace html br with java \n. The recursive method ensures that the linebreaker can never end up pre-existing in the text being replaced. * @param html * @param linebreakerString * @return the html as String with proper java newlines instead of br */ public static String replaceBrWithNewLine(String html, String linebreakerString){ String result = ""; if(html.contains(linebreakerString)){ result = replaceBrWithNewLine(html, linebreakerString+"1"); } else { result = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", linebreakerString)).text(); // replace and html line breaks with java linebreak. result = result.replaceAll(linebreakerString, "\n"); } return result; }

Used when invoking with the html in question containing br, along with any line that you want to use as a temporary placeholder for a new line. For example:

 replaceBrWithNewLine(element.html(), "br2n")

Recursion ensures that the line you use as a newline / line placeholder will never be in the original html, as it will continue to add "1" until the linkbreaker placeholder string is found in html. It will not have a formatting problem, which Jsoup.clean methods seem to encounter special characters.

Bevor · Answer 15 · 2016-05-31 18:14

Based on the answers of user 121196 and Green Beret with select and <pre> s, the only solution that works for me is:

 org.jsoup.nodes.Element elementWithHtml = .... elementWithHtml.select("br").append("<pre>\n</pre>"); elementWithHtml.select("p").prepend("<pre>\n\n</pre>"); elementWithHtml.text();

How to save line breaks when using jsoup to convert html to plain text?

More articles: