Remove HTML tags from string including & nbsp in C #

Question

Remove HTML tags from string including & nbsp in C #

How to remove all HTML tags, including & nbsp, using regex in C #. My line looks like

"<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"

+80

string html c # regex

rampuriyaaa Oct 22 '13 at 16:56

source share

9 answers

I took the @Ravi Thapliyal code and made a method: it is simple and may not clear everything, but so far it does what I need.

 public static string ScrubHtml(string value) { var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim(); var step2 = Regex.Replace(step1, @"\s{2,}", " "); return step2; }

+30

Don Rol Jul 31 '14 at 14:50

source share

I have been using this feature for a while. Removes almost any messy html that you can throw at it and leaves the text intact.

  private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled); //add characters that are should not be removed to this regex private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\\?=|%!() -]", RegexOptions.Compiled); public static String UnHtml(String html) { html = HttpUtility.UrlDecode(html); html = HttpUtility.HtmlDecode(html); html = RemoveTag(html, "<!--", "-->"); html = RemoveTag(html, "<script", "</script>"); html = RemoveTag(html, "<style", "</style>"); //replace matches of these regexes with space html = _tags_.Replace(html, " "); html = _notOkCharacter_.Replace(html, " "); html = SingleSpacedTrim(html); return html; } private static String RemoveTag(String html, String startTag, String endTag) { Boolean bAgain; do { bAgain = false; Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase); if (startTagPos < 0) continue; Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase); if (endTagPos <= startTagPos) continue; html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length); bAgain = true; } while (bAgain); return html; } private static String SingleSpacedTrim(String inString) { StringBuilder sb = new StringBuilder(); Boolean inBlanks = false; foreach (Char c in inString) { switch (c) { case '\r': case '\n': case '\t': case ' ': if (!inBlanks) { inBlanks = true; sb.Append(' '); } continue; default: inBlanks = false; sb.Append(c); break; } } return sb.ToString().Trim(); }

+16

David S. Oct. 22 '13 at 17:14

source share

 var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();

+4

MRP Jun 11 '14 at 6:27

source share

 (<.+?> | &nbsp;)

will match any tag or  

 string regex = @"(<.+?>|&nbsp;)"; var x = Regex.Replace(originalString, regex, "").Trim();

then x = hello

0

Jonesopolis Oct 22 '13 at 17:08

source share

Redeveloping an Html document involves many complex things. This package can help: https://github.com/mganss/HtmlSanitizer

0

fantasticoder Jan 04 '16 at 19:54

source share

HTML in its main form is simply XML. You can parse your text in an XmlDocument object, and in the root element, call InnerText to extract the text. This will remove all HTML tags in any form, and will also work with special characters such as & lt; & Nbsp; all in one go.

0

nivs1978 May 16 '18 at 6:54

source share

I used the code @RaviThapliyal & @Don Rolling, but made a slight modification. Since we replace & nbsp with an empty string, but instead should replace & nbsp with a space, an additional step has been added. It worked for me like a charm.

 public static string FormatString(string value) { var step1 = Regex.Replace(value, @"<[^>]+>", "").Trim(); var step2 = Regex.Replace(step1, @"&nbsp;", " "); var step3 = Regex.Replace(step2, @"\s{2,}", " "); return step3; }

Used & nbps without a semicolon because it was formatted by stack overflow.

0

Sabique A Khan Apr 09 '19 at 5:02

source share

 (<([^>]+)>|&nbsp;)

You can check it here: https://regex101.com/r/kB0rQ4/1

-one

Ananth Ram Feb 10 '17 at 17:58

source share

Ravi Thapliyal · Accepted Answer · 2013-10-22 17:08

If you can't use the parser-oriented HTML solution for filtering tags, here's a simple regex.

 string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

Ideally, you should make another pass through the regular expression filter, which will serve a few spaces, like

 string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

Remove HTML tags from string including & nbsp in C #

More articles: