How to convert HTML to text strings

Question

How to convert HTML to text strings

How can I convert HTML to text strings (created by elements such as br, p, div, ...) using NekoHTML or any decent enough HTML parser

Example:
Hello<br/>World
at:

 Hello\n World

+6

java html

Eduardo Mar 25 '10 at 7:24

source share

6 answers

jasop · Answer 1 · 2011-12-15 01:39

Here is the function I did to output the text (including line breaks) by iterating over the nodes using Jsoup.

 public static String htmlToText(InputStream html) throws IOException { Document document = Jsoup.parse(html, null, ""); Element body = document.body(); return buildStringFromNode(body).toString(); } private static StringBuffer buildStringFromNode(Node node) { StringBuffer buffer = new StringBuffer(); if (node instanceof TextNode) { TextNode textNode = (TextNode) node; buffer.append(textNode.text().trim()); } for (Node childNode : node.childNodes()) { buffer.append(buildStringFromNode(childNode)); } if (node instanceof Element) { Element element = (Element) node; String tagName = element.tagName(); if ("p".equals(tagName) || "br".equals(tagName)) { buffer.append("\n"); } } return buffer; }

weakish · Answer 2 · 2010-04-08 13:34

 w3m -dump -no-cookie input.html > output.txt

msw · Answer 3 · 2010-03-25 07:55

I found a relatively smart solution in html2txt: ASCIINATOR , which does a wonderful job of creating the nroff output file (for example, how to man ls run on the terminal). It produces Markdown -style output, which uses StackOverflow as input.

For moderately complex pages, such as this page, the output is somewhat scattered , as it tries to strongly turn the non-linear layout into something linear. The conclusion from less complex markup is pretty readable .

Kevin Reid · Answer 4 · 2010-03-25 11:20

If you don't mind hard-shell output / intended for a monospace, lynx -dump creates good text from HTML.

Blessed Geek · Answer 5 · 2010-04-13 06:20

HTML for text: I accept this statement to indicate that all HTML formatting, with the exception of line breaks, will be canceled.

What I did for such an enterprise, I use regexp to detect any set of tags. If the value in the tags is br or br /, a line break is inserted, otherwise the tag will be discarded.

It works only for simple html pages. The tables will obviously be linearized.

I was thinking about determining the value of the title between the wrapper of the title tag so that the converter automatically puts the title at the top of the page. We need to add some more algorithm. To my time it is better to spend with ...

I read about using the Google Data API to load a document into Google Docs, and then using the same API to load / export it as text. Or why text when I could do pdf. But you should get a Google account if you don't already have one.

Download / Export Google Docs Data

Google api docs data for java

Kyra · Answer 6 · 2010-04-07 06:05

Does it matter what language you use? You can always use pattern matching. Basically, tags for breaking HTML tags (br, p, div, ...) you can replace with "\ n" and remove all other tags. You can always store tags in an array so you can easily check when you go through the HTML file. Then any other tags and all other end tags (/ p, ..) can be replaced with an empty string, so we get your result.

How to convert HTML to text strings

More articles: