How do you convert html to plain text?

Question

How do you convert html to plain text?

I have Html fragments stored in a table. Not whole pages, not tags, etc., Just basic formatting.

I would like to be able to display this Html only as text, without formatting, on this page (in fact, this is only the first 30-50 characters, but this is a simple bit).

How to put "text" inside this Html in a string as plain text?

So, this piece of code.

<b>Hello World.</b><br/><p><i>Is there anyone out there?</i><p>

becomes:

Hello World. Is anyone there?

+77

html c # asp.net

Stuart Helwig Nov 13 '08 at 11:00

source share

17 answers

The free and open source htmlAgilityPack has, in one example, a method that converts HTML to plain text.

 var plainText = HtmlUtilities.ConvertToPlainText(string html);

Stream is an HTML string like

<b> Hello world! </b> <br/> <i> it's me! </I>

And you get a simple text result:

 hello world! it is me!

+82

Judah Gabriel Himango Jul 13 '09 at 19:17

source share

I could not use HtmlAgilityPack, so I wrote the second best solution for myself

 private static string HtmlToPlainText(string html) { const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<' const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR /> var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline); var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline); var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline); var text = html; //Decode html specific characters text = System.Net.WebUtility.HtmlDecode(text); //Remove tag whitespace/line breaks text = tagWhiteSpaceRegex.Replace(text, "><"); //Replace <br /> with line breaks text = lineBreakRegex.Replace(text, Environment.NewLine); //Strip formatting text = stripFormattingRegex.Replace(text, string.Empty); return text; }

+35

Ben Anderson May 6 '13 at 21:06

source share

HTTPUtility.HTMLEncode() designed to process HTML tags as strings. He takes care of all the heavy burdens for you. From the MSDN Documentation :

If characters, such as spaces and punctuation, are transmitted in an HTTP stream, they may be misinterpreted on the receiving side. HTML coding converts characters that are not allowed in HTML into character equivalents; HTML decoding overrides encoding. For example, when nested in a block of text, the characters < and > are encoded as < and > to send HTTP.

HTTPUtility.HTMLEncode() , details here :

 public static void HtmlEncode( string s, TextWriter output )

Using:

 String TestString = "This is a <Test String>."; StringWriter writer = new StringWriter(); Server.HtmlEncode(TestString, writer); String EncodedString = writer.ToString();

+20

George Stocker Nov 13 '08 at 13:42

source share

To add vfilby to the answer, you can simply do a RegEx replacement in your code; no new classes. In case other newcomers like me will stumble on this question.

 using System.Text.RegularExpressions;

Then...

 private string StripHtml(string source) { string output; //get rid of HTML tags output = Regex.Replace(source, "<[^>]*>", string.Empty); //get rid of multiple blank lines output = Regex.Replace(output, @"^\s*$\n", string.Empty, RegexOptions.Multiline); return output; }

+10

WEFX Mar 11 '11 at 18:14

source share

Three-step process for converting HTML to plain text

First, you need to install the Nuget package for HtmlAgilityPack. Secondly, create this class.

 public class HtmlToText { public HtmlToText() { } public string Convert(string path) { HtmlDocument doc = new HtmlDocument(); doc.Load(path); StringWriter sw = new StringWriter(); ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); } public string ConvertHtml(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); StringWriter sw = new StringWriter(); ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); } private void ConvertContentTo(HtmlNode node, TextWriter outText) { foreach(HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText); } } public void ConvertTo(HtmlNode node, TextWriter outText) { string html; switch(node.NodeType) { case HtmlNodeType.Comment: // don't output comments break; case HtmlNodeType.Document: ConvertContentTo(node, outText); break; case HtmlNodeType.Text: // script and style must not be output string parentName = node.ParentNode.Name; if ((parentName == "script") || (parentName == "style")) break; // get text html = ((HtmlTextNode)node).Text; // is it in fact a special closing node output as text? if (HtmlNode.IsOverlappedClosingElement(html)) break; // check the text is meaningful and not a bunch of whitespaces if (html.Trim().Length > 0) { outText.Write(HtmlEntity.DeEntitize(html)); } break; case HtmlNodeType.Element: switch(node.Name) { case "p": // treat paragraphs as crlf outText.Write("\r\n"); break; } if (node.HasChildNodes) { ConvertContentTo(node, outText); } break; } } }

Using the above class with reference to Judah Himango answer

Thirdly, you need to create an object of the above class and use ConvertHtml(HTMLContent) to convert HTML to ConvertToPlainText(string html); text, not ConvertToPlainText(string html);

 HtmlToText htt=new HtmlToText(); var plainText = htt.ConvertHtml(HTMLContent);

+6

Abdulqadir_WDDN Oct 11 '17 at 6:38

source share

There is no method named "ConvertToPlainText" in HtmlAgilityPack, but you can convert the html string to a CLEAR string using

 HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(htmlString); var textString = doc.DocumentNode.InnerText; Regex.Replace(textString , @"<(.|n)*?>", string.Empty).Replace("&nbsp", "");

This works for me. BUT I DO NOT FIND A METHOD AFTER "ConvertToPlainText" IN "HtmlAgilityPack".

+5

Amine Mar 26 '13 at 9:22

source share

The easiest way I've found:

 HtmlFilter.ConvertToPlainText(html);

The HtmlFilter class is located in Microsoft.TeamFoundation.WorkItemTracking.Controls.dll

The DLL can be found in the following folder:% ProgramFiles% \ Common Files \ microsoft shared \ Team Foundation Server \ 14.0 \

In VS 2015, the dll also requires a link to the Microsoft.TeamFoundation.WorkItemTracking.Common.dll file located in the same folder.

+4

Roman O Apr 19 '17 at 12:48 on

source share

It has a limitation in that it does not break long embedded spaces, but it is definitely portable and fits the layout like a web browser.

 static string HtmlToPlainText(string html) { string buf; string block = "address|article|aside|blockquote|canvas|dd|div|dl|dt|" + "fieldset|figcaption|figure|footer|form|h\\d|header|hr|li|main|nav|" + "noscript|ol|output|p|pre|section|table|tfoot|ul|video"; string patNestedBlock = $"(\\s*?</?({block})[^>]*?>)+\\s*"; buf = Regex.Replace(html, patNestedBlock, "\n", RegexOptions.IgnoreCase); // Replace br tag to newline. buf = Regex.Replace(buf, @"<(br)[^>]*>", "\n", RegexOptions.IgnoreCase); // (Optional) remove styles and scripts. buf = Regex.Replace(buf, @"<(script|style)[^>]*?>.*?</\1>", "", RegexOptions.Singleline); // Remove all tags. buf = Regex.Replace(buf, @"<[^>]*(>|$)", "", RegexOptions.Multiline); // Replace HTML entities. buf = WebUtility.HtmlDecode(buf); return buf; }

+4

jeiea May 16 '18 at 5:32

source share

I think the easiest way is to create an extension method of 'string' (based on what Richard Richard suggested):

 using System; using System.Text.RegularExpressions; public static class StringHelpers { public static string StripHTML(this string HTMLText) { var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase); return reg.Replace(HTMLText, ""); } }

Then just use this extension method for any 'string' variable in your program:

 var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>"; var yourTextString = yourHtmlString.StripHTML();

I use this extension method to convert HTML comments to plain text so that it displays correctly in the crystal report and it works great!

+3

mik-t May 08 '12 at 9:52 pm

source share

If you have data that contains HTML tags and want to display it so that a person can see the tags, use HttpServerUtility :: HtmlEncode.

If you have data that has HTML tags and you want the user to see the displayed tags, then display the text as is. If the text is an entire web page, use IFRAME for it.

If you have data containing HTML tags and want to cut tags and just display plain text, use regular expression.

+2

Corey Trager Nov 13 '08 at 12:41

source share

I ran into a similar problem and found a better solution. Below code works perfect for me.

  private string ConvertHtml_Totext(string source) { try { string result; // Remove HTML Development formatting // Replace line breaks with space // because browsers inserts space result = source.Replace("\r", " "); // Replace line breaks with space // because browsers inserts space result = result.Replace("\n", " "); // Remove step-formatting result = result.Replace("\t", string.Empty); // Remove repeating spaces because browsers ignore them result = System.Text.RegularExpressions.Regex.Replace(result, @"( )+", " "); // Remove the header (prepare first by clearing attributes) result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*head([^>])*>","<head>", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"(<( )*(/)( )*head( )*>)","</head>", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, "(<head>).*(</head>)",string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase); // remove all scripts (prepare first by clearing attributes) result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*script([^>])*>","<script>", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"(<( )*(/)( )*script( )*>)","</script>", System.Text.RegularExpressions.RegexOptions.IgnoreCase); //result = System.Text.RegularExpressions.Regex.Replace(result, // @"(<script>)([^(<script>\.</script>)])*(</script>)", // string.Empty, // System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"(<script>).*(</script>)",string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase); // remove all styles (prepare first by clearing attributes) result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*style([^>])*>","<style>", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"(<( )*(/)( )*style( )*>)","</style>", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, "(<style>).*(</style>)",string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase); // insert tabs in spaces of <td> tags result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*td([^>])*>","\t", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // insert line breaks in places of <BR> and <LI> tags result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*br( )*>","\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*li( )*>","\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // insert line paragraphs (double line breaks) in place // if <P>, <DIV> and <TR> tags result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*div([^>])*>","\r\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*tr([^>])*>","\r\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*p([^>])*>","\r\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Remove remaining tags like <a>, links, images, // comments etc - anything that enclosed inside < > result = System.Text.RegularExpressions.Regex.Replace(result, @"<[^>]*>",string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase); // replace special characters: result = System.Text.RegularExpressions.Regex.Replace(result, @" "," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&bull;"," * ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&lsaquo;","<", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&rsaquo;",">", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&trade;","(tm)", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&frasl;","/", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&lt;","<", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&gt;",">", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&copy;","(c)", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&reg;","(r)", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Remove all others. More can be added, see // http://hotwired.lycos.com/webmonkey/reference/special_characters/ result = System.Text.RegularExpressions.Regex.Replace(result, @"&(.{2,6});", string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase); // for testing //System.Text.RegularExpressions.Regex.Replace(result, // this.txtRegex.Text,string.Empty, // System.Text.RegularExpressions.RegexOptions.IgnoreCase); // make line breaking consistent result = result.Replace("\n", "\r"); // Remove extra line breaks and tabs: // replace over 2 breaks with 2 and over 4 tabs with 4. // Prepare first to remove any whitespaces in between // the escaped characters and remove redundant tabs in between line breaks result = System.Text.RegularExpressions.Regex.Replace(result, "(\r)( )+(\r)","\r\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, "(\t)( )+(\t)","\t\t", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, "(\t)( )+(\r)","\t\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, "(\r)( )+(\t)","\r\t", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Remove redundant tabs result = System.Text.RegularExpressions.Regex.Replace(result, "(\r)(\t)+(\r)","\r\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Remove multiple tabs following a line break with just one tab result = System.Text.RegularExpressions.Regex.Replace(result, "(\r)(\t)+","\r\t", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Initial replacement target string for line breaks string breaks = "\r\r\r"; // Initial replacement target string for tabs string tabs = "\t\t\t\t\t"; for (int index=0; index<result.Length; index++) { result = result.Replace(breaks, "\r\r"); result = result.Replace(tabs, "\t\t\t\t"); breaks = breaks + "\r"; tabs = tabs + "\t"; } // That it. return result; } catch { MessageBox.Show("Error"); return source; }

}

You must first remove escape characters such as \ n and \ r, since they cause regular expressions to stop working as expected.

In addition, in order for the result line to display correctly in the text field, you may need to split it into parts and set the Text Lines property to the property instead of assigning it to the Text property.

this.txtResult.Lines = StripHTML (this.txtSource.Text) .Split ("\ r" .ToCharArray ());

Source: https://www.codeproject.com/Articles/11902/Convert-HTML-to-Plain-Text-2

+2

LakshmiSarada Oct 17 '18 at 19:54

source share

Depends on what you mean by "html". The most difficult case would be full web pages. This is also the easiest to handle since you can use a text browser. See the Wikipedia article for a list of web browsers, including text mode browsers. Lynx is probably the best known, but one of the others may be better for your needs.

0

mpez0 Nov 13 '08 at 12:46

source share

Here is my solution:

 public string StripHTML(string html) { var regex = new Regex("<[^>]+>", RegexOptions.IgnoreCase); return System.Web.HttpUtility.HtmlDecode((regex.Replace(html, ""))); }

Example:

 StripHTML("<p class='test' style='color:red;'>Here is my solution:</p>"); // output -> Here is my solution:

0

Mehdi Dehghani Nov 03 '17 at 9:25

source share

I had the same question, just my html had a simple pre-known layout, for example:

 <DIV><P>abc</P><P>def</P></DIV>

In the end, I used this simple code:

 string.Join (Environment.NewLine, XDocument.Parse (html).Root.Elements ().Select (el => el.Value))

What are the findings:

 abc def

0

Karlas Apr 17 '18 at 13:45

source share

Not written, but use:

 using HtmlAgilityPack; using System; using System.IO; using System.Text.RegularExpressions; namespace foo { //small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs public static class HtmlToText { public static string Convert(string path) { HtmlDocument doc = new HtmlDocument(); doc.Load(path); return ConvertDoc(doc); } public static string ConvertHtml(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); return ConvertDoc(doc); } public static string ConvertDoc(HtmlDocument doc) { using (StringWriter sw = new StringWriter()) { ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); } } internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) { foreach (HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText, textInfo); } } public static void ConvertTo(HtmlNode node, TextWriter outText) { ConvertTo(node, outText, new PreceedingDomTextInfo(false)); } internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) { string html; switch (node.NodeType) { case HtmlNodeType.Comment: // don't output comments break; case HtmlNodeType.Document: ConvertContentTo(node, outText, textInfo); break; case HtmlNodeType.Text: // script and style must not be output string parentName = node.ParentNode.Name; if ((parentName == "script") || (parentName == "style")) { break; } // get text html = ((HtmlTextNode)node).Text; // is it in fact a special closing node output as text? if (HtmlNode.IsOverlappedClosingElement(html)) { break; } // check the text is meaningful and not a bunch of whitespaces if (html.Length == 0) { break; } if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace) { html = html.TrimStart(); if (html.Length == 0) { break; } textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true; } outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " "))); if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1])) { outText.Write(' '); } break; case HtmlNodeType.Element: string endElementString = null; bool isInline; bool skip = false; int listIndex = 0; switch (node.Name) { case "nav": skip = true; isInline = false; break; case "body": case "section": case "article": case "aside": case "h1": case "h2": case "header": case "footer": case "address": case "main": case "div": case "p": // stylistic - adjust as you tend to use if (textInfo.IsFirstTextOfDocWritten) { outText.Write("\r\n"); } endElementString = "\r\n"; isInline = false; break; case "br": outText.Write("\r\n"); skip = true; textInfo.WritePrecedingWhiteSpace = false; isInline = true; break; case "a": if (node.Attributes.Contains("href")) { string href = node.Attributes["href"].Value.Trim(); if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase) == -1) { endElementString = "<" + href + ">"; } } isInline = true; break; case "li": if (textInfo.ListIndex > 0) { outText.Write("\r\n{0}.\t", textInfo.ListIndex++); } else { outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022 } isInline = false; break; case "ol": listIndex = 1; goto case "ul"; case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems endElementString = "\r\n"; isInline = false; break; case "img": //inline-block in reality if (node.Attributes.Contains("alt")) { outText.Write('[' + node.Attributes["alt"].Value); endElementString = "]"; } if (node.Attributes.Contains("src")) { outText.Write('<' + node.Attributes["src"].Value + '>'); } isInline = true; break; default: isInline = true; break; } if (!skip && node.HasChildNodes) { ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten) { ListIndex = listIndex }); } if (endElementString != null) { outText.Write(endElementString); } break; } } } internal class PreceedingDomTextInfo { public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten) { IsFirstTextOfDocWritten = isFirstTextOfDocWritten; } public bool WritePrecedingWhiteSpace { get; set; } public bool LastCharWasSpace { get; set; } public readonly BoolWrapper IsFirstTextOfDocWritten; public int ListIndex { get; set; } } internal class BoolWrapper { public BoolWrapper() { } public bool Value { get; set; } public static implicit operator bool(BoolWrapper boolWrapper) { return boolWrapper.Value; } public static implicit operator BoolWrapper(bool boolWrapper) { return new BoolWrapper { Value = boolWrapper }; } } }

0

sobelito Apr 29 '18 at 10:16

source share

public static string StripTags2 (html string) {return html.Replace ("<", "<"). Replace (">", ">"); }

This way you avoid all the "<" and ">" in the string. Is this what you want?

-four

José Leal Nov 13 '08 at 12:37

source share

vfilby · Accepted Answer · 2008-11-13 12:44

If you're talking about disabling tags, this is relatively easy to go, unless you need to worry about things like <script> tags. If all you have to do is display the text without tags that you can execute with a regex:

 <[^>]*>

If you need to worry about <script> tags, etc., then you will need something more powerful than regular expressions because you need to track the state, which is more like free context grammar (CFG). Although you could do it using Left to Right or non-greedy matching.

If you can use regular expressions, there are many web pages with good information:

If you need more complex CFG behavior, I would suggest using a third-party tool, unfortunately, I do not know what is good to recommend.

How do you convert html to plain text?

More articles: