The best solution I found was to use cyberneko to parse your string and do some βsmartβ SAX event handling.
cyberneko will parse your HTML even if it is invalid, which is the case for the vast majority of HTML that you are likely to encounter in the wild.
If you register a custom ContentHandler that essentially ignores all events except character events and just adds them to the line builder, you get a good first approximation with an annoying flaw: words separated by a block element will be concatenated ( for<br/>example => forexample ).
The best solution is to get a list of all the elements of the block and listen to the ContentHandler in startElement events. If the item is blocky, just add a space character to the line builder.
Please note that while this seems to work fine, it may not be ideal for your use case. <br/> , for example, does not turn into a line break. This should not be too much work to add if necessary.
Nicolas rinaudo
source share