Remove HTML tags from string including & nbsp in C #
How to remove all HTML tags, including & nbsp, using regex in C #. My line looks like
"<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div> </div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>" If you can't use the parser-oriented HTML solution for filtering tags, here's a simple regex.
string noHTML = Regex.Replace(inputHTML, @"<[^>]+>| ", "").Trim(); Ideally, you should make another pass through the regular expression filter, which will serve a few spaces, like
string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " "); I took the @Ravi Thapliyal code and made a method: it is simple and may not clear everything, but so far it does what I need.
public static string ScrubHtml(string value) { var step1 = Regex.Replace(value, @"<[^>]+>| ", "").Trim(); var step2 = Regex.Replace(step1, @"\s{2,}", " "); return step2; } I have been using this feature for a while. Removes almost any messy html that you can throw at it and leaves the text intact.
private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled); //add characters that are should not be removed to this regex private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\\?=|%!() -]", RegexOptions.Compiled); public static String UnHtml(String html) { html = HttpUtility.UrlDecode(html); html = HttpUtility.HtmlDecode(html); html = RemoveTag(html, "<!--", "-->"); html = RemoveTag(html, "<script", "</script>"); html = RemoveTag(html, "<style", "</style>"); //replace matches of these regexes with space html = _tags_.Replace(html, " "); html = _notOkCharacter_.Replace(html, " "); html = SingleSpacedTrim(html); return html; } private static String RemoveTag(String html, String startTag, String endTag) { Boolean bAgain; do { bAgain = false; Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase); if (startTagPos < 0) continue; Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase); if (endTagPos <= startTagPos) continue; html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length); bAgain = true; } while (bAgain); return html; } private static String SingleSpacedTrim(String inString) { StringBuilder sb = new StringBuilder(); Boolean inBlanks = false; foreach (Char c in inString) { switch (c) { case '\r': case '\n': case '\t': case ' ': if (!inBlanks) { inBlanks = true; sb.Append(' '); } continue; default: inBlanks = false; sb.Append(c); break; } } return sb.ToString().Trim(); } var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)| |‌|»|«", string.Empty).Trim(); (<.+?> | ) will match any tag or
string regex = @"(<.+?>| )"; var x = Regex.Replace(originalString, regex, "").Trim(); then x = hello
Redeveloping an Html document involves many complex things. This package can help: https://github.com/mganss/HtmlSanitizer
HTML in its main form is simply XML. You can parse your text in an XmlDocument object, and in the root element, call InnerText to extract the text. This will remove all HTML tags in any form, and will also work with special characters such as & lt; & Nbsp; all in one go.
I used the code @RaviThapliyal & @Don Rolling, but made a slight modification. Since we replace & nbsp with an empty string, but instead should replace & nbsp with a space, an additional step has been added. It worked for me like a charm.
public static string FormatString(string value) { var step1 = Regex.Replace(value, @"<[^>]+>", "").Trim(); var step2 = Regex.Replace(step1, @" ", " "); var step3 = Regex.Replace(step2, @"\s{2,}", " "); return step3; } Used & nbps without a semicolon because it was formatted by stack overflow.
(<([^>]+)>| ) You can check it here: https://regex101.com/r/kB0rQ4/1