Regex problem for removing HTML tags

In my Ruby application, I used the following method and regular expression to remove all HTML tags from a string:

str.gsub(/<\/?[^>]*>/,"")

This regex did just about everything I expected, except that all quotes were converted to &#8220; and all single quotes should be changed to &#8221; .

What obvious thing am I missing to convert messy codes back to their proper characters?

Edit: The problem occurs with or without regex, so I see that my problem has nothing to do with it. My question now is how to handle this formatting error and fix it. Thanks!

+5
source share
5 answers

Use CGI :: unescapeHTML after performing a regular expression replacement:

CGI::unescapeHTML(str.gsub(/<\/?[^>]*>/,""))

See http://www.ruby-doc.org/core/classes/CGI.html#M000547

In the code snippet above, gsub removes all the HTML tags. Then unescapeHTML () returns all HTML objects (e.g. <, & # 8220) to their actual characters (<, quotes, etc.).

As for the other post on this page, note that you will never pass HTML, for example

<tag attribute="<value>">2 + 3 < 6</tag>

(this is invalid HTML); what you can get instead:

<tag attribute="&lt;value&gt;">2 + 3 &lt; 6</tag>

The gsub call converts the above value to:

2 + 3 &lt; 6

And unescapeHTML will exit:

2 + 3 < 6
+5
source

You will have additional problems when you see something like:

<doohickey name="<foobar>">

You want to apply something like:

gsub(/<[^<>]*>/, "")

... , .

+2

, , " ".

, RegExp . , ?

, :
UTF-8 UTF-8 php.

+2

, , , UTF-8, , , ( ) .

0

You can use the multi-pass system to get the results you are looking for.

After starting the regular expression, run the expression to convert & 8220; in quotation marks, and another for conversion and 8221; in single quotes.

-3
source

All Articles