RichTextArea based on contentEditable browser contentEditable . This means that the HTML tag soup you come across will be platform, source and browser specific. When you say “optimize,” what is your ultimate goal? What part of the original formatting do you want to keep? Besides the simple trivial minimization of the HTML that is inserted, any significant reduction in HTML complexity is likely to result in a loss of visual fidelity.
Utilities such as HTML Tidy or any derivative of it may help you in the aspect of minimization. If your goal is to reduce HTML complexity, you can use HTMLUnit as a captured server browser to render pasted content in memory, and then retrieve attributes that you find useful from the HTMLUnit DOM. FWIW, this is one way to make search engine AJAX applications crawl.
While lowering visual accuracy can be a bit confusing for the original user, this gives you the ability to unify the visual style of all inserted content. If you build a site based on the input of many users, this uniformity reduces the amount of mental effort required to orient (i.e. see what you see) the content.
Bobv
source share