Java library for html string trimming?

Question

Java library for html string trimming?

I need to trim the html string that was already cleared by my application before being stored in the database and contains only links, images and formatting tags. But when presented to users, it needs to be trimmed to present an overview of the content.

So I have to abbreviate html lines in java so that

<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" /> <br/><a href="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />

when truncated does not return something like this

 <img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" /> <br/><a href="htt

but instead returns

 <img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" /> <br/>

+5

java string sanitization

Rajat gupta Feb 17 '15 at 17:09

source share

7 answers

Hoopje · Answer 1 · 2015-02-22T10:56:29+0000

Your requirements are a bit vague, even after reading all the comments. Given your example and explanation, I assume your requirements are as follows:

The input is a string consisting of (x) html tags. Your example does not contain this, but I assume that the input may contain text between tags.
In the context of your problem, we do not care about nesting. Thus, input really is just text mixed with tags, where opening, closing, and self-closing tags are considered equivalent.
Tags may contain quoted values.
You want to truncate your string so that the string is not truncated in the middle of the tag. Thus, in a truncated line, each '<' character must have a corresponding character ">".

I will give you two solutions, simple, which may be incorrect, depending on what is entered exactly, and more complicated, which is correct.

First decision

For the first solution, we first find the last ">" before the truncation size (this corresponds to the last tag that was completely closed). After this symbol, text may appear that does not belong to any tag, so we are looking for the first '<' after the last closed tag. In code:

 public static String truncate1(String input, int size) { if (input.length() < size) return input; int pos = input.lastIndexOf('>', size); int pos2 = input.indexOf('<', pos); if (pos2 < 0 || pos2 >= size) { return input.substring(0, size); } else { return input.substring(0, pos2); } }

Of course, this solution does not take into account quoted strings: '<' and '>' characters may appear inside the string, in which case they should be ignored. I mention the solution anyway, because you mention that your entry is sanitized, so maybe you can make sure the quoted lines never contain '<' and '>'.

Second solution

To look at quoted strings, we can no longer rely on standard Java classes, but we must scan the input ourselves and remember whether we are inside the tag and inside the string or not. If we encounter a '<' character outside the line, we remember its position, so that when we reach the truncation point, we know the position of the last open tag. If this tag has not been closed, we trim it before the start of this tag. In code:

 public static String truncate2(String input, int size) { if (input.length() < size) return input; int lastTagStart = 0; boolean inString = false; boolean inTag = false; for (int pos = 0; pos < size; pos++) { switch (input.charAt(pos)) { case '<': if (!inString && !inTag) { lastTagStart = pos; inTag = true; } break; case '>': if (!inString) inTag = false; break; case '\"': if (inTag) inString = !inString; break; } } if (!inTag) lastTagStart = size; return input.substring(0, lastTagStart); }

simbo1905 · Answer 2 · 2015-02-19T22:01:07+0000

An easy way to do this is to use hotsax code that parses HTML, allowing you to interact with the parser using the traditional low-level SAX XML API [Note that this is not an XML parser, which parses poorly formed HTML, only allows you to interact with it, using the standard XML API).

Here on github , I created a working, quick and dirty sample project with a main class that parses your line with a truncated example:

  XMLReader parser = XMLReaderFactory.createXMLReader("hotsax.html.sax.SaxParser"); final StringBuilder builder = new StringBuilder(); ContentHandler handler = new DoNothingContentHandler(){ StringBuilder wholeTag = new StringBuilder(); boolean hasText = false; boolean hasElements = false; String lastStart = ""; @Override public void characters(char[] ch, int start, int length) throws SAXException { String text = (new String(ch, start, length)).trim(); wholeTag.append(text); hasText = true; } @Override public void endElement(String namespaceURI, String localName, String qName) throws SAXException { if( !hasText && !hasElements && lastStart.equals(localName)) { builder.append("<"+localName+"/>"); } else { wholeTag.append("</"+ localName +">"); builder.append(wholeTag.toString()); } wholeTag = new StringBuilder(); hasText = false; hasElements = false; } @Override public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException { wholeTag.append("<"+ localName); for( int i = 0; i < atts.getLength(); i++) { wholeTag.append(" "+atts.getQName(i)+"='"+atts.getValue(i)+"'"); hasElements = true; } wholeTag.append(">"); lastStart = localName; hasText = false; } }; parser.setContentHandler(handler); //parser.parse(new InputSource( new StringReader( "<div>this is the <em>end</em> my <br> friend <a href=\"whatever\">some link</a>" ) )); parser.parse(new InputSource( new StringReader( "<img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />\n<br/><a href=\"htt" ) )); System.out.println( builder.toString() );

It outputs:

<img src='http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg'></img><br/>

It adds a </img> , but it is harmless for html, and it would be possible to tweak the code to exactly match the input in the output if you think it is necessary.

Hotsax actually generated the code using the yacc / flex compiler tools that run on the HtmlParser.y and StyleLexer.flex files that define low-level html grammar. Thus, you benefit from the work of the person who created this grammar; all you have to do is write pretty trivial code and test cases to collect the parsed fragments, as shown above. This is much better than trying to write your own regular expressions, or the worst and coded string scanner, to try to interpret the string, as it is very fragile.

Martin kersten · Answer 3 · 2015-02-23T20:46:47+0000

Afer I understand what you want here, this is the easiest solution I could come up with.

Just work from the end of your substring to the beginning until you find a ">". This is the end tag of the last tag. This way you can be sure that in most cases you only have tags.

But what if <> inside the texts?

To be sure of this, just search until you find <and make sure it's part of the tag (do you know the tag string, for example?), Since you only have links, images and formations, you can easily check this. If you find another> before searching for <start of tag, this is the new end of your line.

Simplicity, correctness and work for you.

If you are not sure if the lines / attributes may contain <or> you need to check the appearance of "and =" to check if you are inside the line or not. (Remember that you can cut attribute values). But I think this is too complicated. I never found an attribute with <and> in it and, as a rule, in the text, it is also escaped using and lt; and something like that.

martin · Answer 4 · 2015-02-26T03:16:58+0000

I don’t know the context of the problem that the OP should solve, but I'm not sure that it makes sense to trim the html code by the length of the source code, and not by the length of its visual representation (which, of course, can become arbitrarily complex).

Perhaps a combined solution can be useful, so you do not punish the html code with a lot of markup or long links, but also set a clear general limit that cannot be exceeded. Like others already written, using a dedicated HTML parser, such as JSoup , allows you to process malformed or even invalid HTML.

The solution is free based on JSoup Cleaner . It traverses the source tree parsing tree and tries to recreate the destination tree with constant checking if the limit is reached.

 import org.jsoup.nodes.*; import org.jsoup.parser.*; import org.jsoup.select.*; String html = "<img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />" + "<br/><a href=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />"; //String html = "<b>foo</b>bar<p class=\"baz\">Some <img />Long Text</p><a href='#'>hello</a>"; Document srcDoc = Parser.parseBodyFragment(html, ""); srcDoc.outputSettings().prettyPrint(false); Document dstDoc = Document.createShell(srcDoc.baseUri()); dstDoc.outputSettings().prettyPrint(false); Element dst = dstDoc.body(); NodeVisitor v = new NodeVisitor() { private static final int MAX_HTML_LEN = 85; private static final int MAX_TEXT_LEN = 40; Element cur = dst; boolean stop = false; int resTextLength = 0; @Override public void head(Node node, int depth) { // ignore "body" element if (depth > 0) { if (node instanceof Element) { Element curElement = (Element) node; cur = cur.appendElement(curElement.tagName()); cur.attributes().addAll(curElement.attributes()); String resHtml = dst.html(); if (resHtml.length() > MAX_HTML_LEN) { cur.remove(); throw new IllegalStateException("html too long"); } } else if (node instanceof TextNode) { String curText = ((TextNode) node).getWholeText(); String resHtml = dst.html(); if (curText.length() + resHtml.length() > MAX_HTML_LEN) { cur.appendText(curText.substring(0, MAX_HTML_LEN - resHtml.length())); throw new IllegalStateException("html too long"); } else if (curText.length() + resTextLength > MAX_TEXT_LEN) { cur.appendText(curText.substring(0, MAX_TEXT_LEN - resTextLength)); throw new IllegalStateException("text too long"); } else { resTextLength += curText.length(); cur.appendText(curText); } } } } @Override public void tail(Node node, int depth) { if (depth > 0 && node instanceof Element) { cur = cur.parent(); } } }; try { NodeTraversor t = new NodeTraversor(v); t.traverse(srcDoc.body()); } catch (IllegalStateException ex) { System.out.println(ex.getMessage()); } System.out.println(" in='" + srcDoc.body().html() + "'"); System.out.println("out='" + dst.html() + "'");

In this example with a maximum length of 85, the result is:

 html too long in='<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg"><br><a href="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg"></a>' out='<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg"><br>'

It also truncates correctly inside nested elements, for maximum html 16 length the result is:

 html too long in='<i>f<b>oo</b>b</i>ar' out='<i>f<b>o</b></i>'

For a maximum text length of 2, the result of a long link:

 text too long in='<a href="someverylonglink"><b>foo</b>bar</a>' out='<a href="someverylonglink"><b>fo</b></a>'

Android Team · Answer 5 · 2015-02-26T08:13:50+0000

You can achieve this with the " JSOUP " library - html parser.

You can download it from the link below.

Download JSOUP

 import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.select.Elements; public class HTMLParser { public static void main(String[] args) { String html = "<img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" /><br/><a href=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" /><img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" /><br/><a href=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />"; Document doc = Jsoup.parse(html); doc.select("a").remove(); System.out.println(doc.body().children()); } }

Martin kersten · Answer 6 · 2015-02-21T21:45:43+0000

Well, whatever you do. There are two jSoup and HtmlParser libraries that I usually use. Please check them out. I also see bearish XHTML in the wild. Its more about HTML5 (which now has no XHTML equivalent).

[Update]

I mention JSoup and HtmlParser, as they are buggy with respect to the browser. Please check if they are suitable for you, as they do very well with distorted and damaged HTML text. Create a DOM from your HTML and return it to the string, you should get rid of the damaged tags, and you can filter the DOM yourself and remove even more content if you need to.

PS: I think that the decade of XML has finally (and with joy) ended. Today, JSON will be abused.

Martin kersten · Answer 7 · 2015-02-26T07:55:43+0000

The third potential answer, which I would consider as a potential solution, is not to work with strings in the first place.

When I remember correctly, there are DOM tree views that work closely with the underlying row view. Therefore, they are accurate. I wrote one myself, but I think jSoup has this mode. Since there are many parsers, you should find one that actually does.

With such a parser, you can easily see which tag is being run from which line to another. In fact, these parsers maintain the line of the document and modify it, but only save information about the range, for example, the start and end positions inside the document, avoiding the multiplication of this information for nested nodes.

Therefore, you can find the most external node for this position, know exactly where and how easy it is to decide whether this tag (including all its children) can be represented in your fragment. Thus, you will have the opportunity to print full text nodes and so without risk only submit partial information about tags or heading text and so on.

If you do not find a parser that suits you, you can ask me to advise.

Java library for html string trimming?

First decision

Second solution

More articles: