Remove some HTML tags with RegExp and Java

I want to remove HTML tags from String. It's easy, I know, I did this:

public String removerTags(String html)  
    {  
        return html.replaceAll("\\<(/?[^\\>]+)\\>", " ").replaceAll("\\s+", " ").trim();  
    }  

The problem is that I do not want to delete all tags. I want a tag

<span style=\"background-color: yellow\"> (text) </ span>

will remain unchanged in the line.

I use this as a kind of "highlight" in finding a web application using GWT, which I do ...

And I need to do this because if the search finds text containing some HTML tag (indexing is done by Lucene) and it does not work, appendHTML from safeHTMLBuilder cannot mount the string.

Can you do it pretty well?

Hugs.

+5
source share
3 answers
+4

, :

public static String filterHTMLTags(String html) {

    // save valid tags:
    String striped = html.replaceAll("(?i)\\<(\\s*/?(a|h\\d|b|i|em|cite|code|strong|pre|br).*?/?)\\>", "{{$1}}");
    // remove all tags:
    striped = striped.replaceAll("\\<(/?[^\\>]+)\\>", " ");
    // restore valid tags:
    striped = striped.replaceAll("\\{\\{(.+?)\\}\\}", "<$1>");

    return striped;
}

, "{{...}}" html-. " ". replaceAll:

( | \d | | | | | | | | )

"h\d" , "h1, h2,..." .

:

public static void main (String[] args) {

    String teste = " <b>test bold chars</b> <BR/> <div>test div</div> \n" +
            " link: <a href=\"test.html\">click here</a> <br />\n" +
            " <script>bad script</script> <notpermitted/>\n";

    System.out.println("teste: \n"+teste);
    System.out.println("\n\n\nstriped: \n"+filterHTMLTags(teste));
}

Bye, -

+1

The library I used in the past is OWASP AntiSamy

This definitely allows whitelisting / blacklisting of tags. Perhaps worth a look.

0
source

All Articles