Strip HTML tags from Scala String

I am developing a web application using Scala and Lift frames. I have a database entry that contains html perex pages

<b>Hi all, this is perex</b> 

And in one scenario I need to print this perex to the user, but without html tags.

 Hi all, this is perex 

Can this be done in Scala? Because I tried to look with Google, but without success.

Thanks for all the answers.

+8
string scala parsing lift strip-tags
source share
3 answers

If the string is valid XML, you can use:

scala.xml.XML.loadString("<b>Hi all, this is parex</b>").text

If this is not valid XML, you can use scala.util.matching.Regex or an HTML parsing library like http://jsoup.org/

+8
source share

The best solution I found was to use cyberneko to parse your string and do some β€œsmart” SAX event handling.

cyberneko will parse your HTML even if it is invalid, which is the case for the vast majority of HTML that you are likely to encounter in the wild.

If you register a custom ContentHandler that essentially ignores all events except character events and just adds them to the line builder, you get a good first approximation with an annoying flaw: words separated by a block element will be concatenated ( for<br/>example => forexample ).

The best solution is to get a list of all the elements of the block and listen to the ContentHandler in startElement events. If the item is blocky, just add a space character to the line builder.

Please note that while this seems to work fine, it may not be ideal for your use case. <br/> , for example, does not turn into a line break. This should not be too much work to add if necessary.

0
source share

TagSoup must meet your requirement in order to parse the realworld html file.

sbt dependencies,

 libraryDependencies += "org.ccil.cowan.tagsoup" % "tagsoup" % "1.2.1" 

Code example

 object TagSoupXmlLoader { private val factory = new SAXFactoryImpl() def get(): XMLLoader[Elem] = { XML.withSAXParser(factory.newSAXParser()) } } 

using,

 val root = TagSoupXmlLoader.get().load("http://www.google.com") println(root) 
0
source share

All Articles