Regex problem for removing HTML tags

Question

Regex problem for removing HTML tags

In my Ruby application, I used the following method and regular expression to remove all HTML tags from a string:

str.gsub(/<\/?[^>]*>/,"")

This regex did just about everything I expected, except that all quotes were converted to “ and all single quotes should be changed to ” .

What obvious thing am I missing to convert messy codes back to their proper characters?

Edit: The problem occurs with or without regex, so I see that my problem has nothing to do with it. My question now is how to handle this formatting error and fix it. Thanks!

+5

string ruby regex encoding

btw Feb 12 '09 at 23:34

source share

5 answers

vladr · Answer 1 · 2009-02-14T23:04:20+0000

Use CGI :: unescapeHTML after performing a regular expression replacement:

CGI::unescapeHTML(str.gsub(/<\/?[^>]*>/,""))

See http://www.ruby-doc.org/core/classes/CGI.html#M000547

In the code snippet above, gsub removes all the HTML tags. Then unescapeHTML () returns all HTML objects (e.g. <, & # 8220) to their actual characters (<, quotes, etc.).

As for the other post on this page, note that you will never pass HTML, for example

<tag attribute="<value>">2 + 3 < 6</tag>

(this is invalid HTML); what you can get instead:

<tag attribute="&lt;value&gt;">2 + 3 &lt; 6</tag>

The gsub call converts the above value to:

2 + 3 &lt; 6

And unescapeHTML will exit:

2 + 3 < 6

Sniggerfardimungus · Answer 2 · 2009-02-12T23:45:50+0000

You will have additional problems when you see something like:

<doohickey name="<foobar>">

You want to apply something like:

gsub(/<[^<>]*>/, "")

... , .

Georg Schölly · Answer 3 · 2009-02-13T00:10:16+0000

, , " ".

, RegExp . , ?

, :
UTF-8 UTF-8 php.

lazyfly · Answer 4 · 2009-02-13T21:15:00+0000

, , , UTF-8, , , ( ) .

Tim · Answer 5 · 2009-02-12T23:40:29+0000

You can use the multi-pass system to get the results you are looking for.

After starting the regular expression, run the expression to convert & 8220; in quotation marks, and another for conversion and 8221; in single quotes.

Regex problem for removing HTML tags

More articles: