Jsoup clean method leaves elements

I tried to use this code to completely clear the text from HTML elements:

Jsoup.clean(preparedText, Whitelist.none()) 

Unfortunately, he did not delete the   . I thought he would replace it with a space, just as he would replace · midpoint ("& middot;").

Should I use a different method to achieve this functionality?

+7
java html jsoup
source share
1 answer

From the Jsoup docs :

Whitelists determine which HTML (elements and attributes) a cleaner allows. Everything else is deleted.

Thus, the whitelist applies only to tags and attributes.   is neither a tag nor an attribute. This is just the html encoding for the special character. If you want to translate from encoding to plain text, you can use, for example, the excellent apache commons lang library or use the Jsoup unescapeEntities method :

 System.out.println(Parser.unescapeEntities(doc.toString(), false)); 

Addendum:

Translation from · on "ยท" already occurs when parsing html. This is not like a clean method.

+2
source share

All Articles