Unexpected results with C # recursion method

I have a pretty simple method that recursively removes start / end html tags

class Program { static void Main(string[] args) { string s = FixHtml("<div><p>this is a <strong>test</strong></p></div>"); Console.WriteLine(s); } private static string FixHtml(string s) { //Remove any outer <div> if (s.ToLower().StartsWith("<div>")) { FixHtml(s.Substring(5, s.Length - 5)); } else if (s.ToLower().StartsWith("<p>")) { FixHtml(s.Substring(3, s.Length - 3)); } else if (s.ToLower().EndsWith("</div>")) { FixHtml(s.Substring(0, s.Length - 6)); } else if (s.ToLower().EndsWith("</p>")) { FixHtml(s.Substring(0, s.Length - 4)); } return s; } } 

The behavior is that it can recursively remove the <div> & <p> tags, but in the "return s" statement it cancels all work, adding back to add tags!

Does anyone know why this is happening? and how to make it return the value I want. iee this is a <strong>test</strong>

+4
source share
6 answers

In .NET, strings are immutable - so your method never actually changes the return value. When you call s.ToLower().StartsWith("<div>") , you return a new line with the expected differences. The existing string s remains unchanged.

In addition, you never consume the return value from your recursive calls.

On top of my head, try something like this:

  private static string FixHtml(string s) { if (s.ToLower().StartsWith("<div>")) { return FixHtml(s.Substring(5, s.Length - 5)); } else if (s.ToLower().StartsWith("<p>")) { return FixHtml(s.Substring(3, s.Length - 3)); } else if (s.ToLower().EndsWith("</div>")) { return FixHtml(s.Substring(0, s.Length - 6)); } else if (s.ToLower().EndsWith("</p>")) { return FixHtml(s.Substring(0, s.Length - 4)); } return s; } 
+14
source

Note that raw text processing is generally a poor way to process xml - for example, you do not process attributes, namespaces, spaces in a tag without spaces (<p > ), etc. presently.

Normally, I would say, load it into the DOM ( XmlDocument / XDocument for xhtml; HTML Agaility Pack for html) - but actually I wonder if xslt would be good in this case ...

For instance:

 static void Main() { string xhtml = @"<div><p>this is a <strong>test</strong></p></div>"; XslCompiledTransform xslt = new XslCompiledTransform(); xslt.Load("strip.xslt"); StringWriter sw = new StringWriter(); using(XmlReader xr = XmlReader.Create(new StringReader(xhtml))) { xslt.Transform(xr, null, sw); } string newHtml = sw.ToString(); Console.WriteLine(newHtml); } 

With strip.xslt:

 <?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="no" omit-xml-declaration="yes"/> <xsl:template match="strong|@*"> <xsl:copy><xsl:apply-templates select="*|text()"/></xsl:copy> </xsl:template> <xsl:template match="*"> <xsl:apply-templates select="*|text()"/> </xsl:template> </xsl:stylesheet> 
+5
source
  • You do nothing with the strings returned by nested calls.
  • When a string changes, a new object is created, not an existing one (they are immutable).
  • If you want to use a similar approach without using a return value, you can make the string parameter the 'ref' parameter. Although the performance decline mentioned by others will continue to apply.
+3
source

you need to add a return to each FixHtml call as follows:

  private static string FixHtml(string s) { //Remove any outer <div> if (s.ToLower().StartsWith("<div>")) { return FixHtml(s.Substring(5, s.Length - 5)); } else if (s.ToLower().StartsWith("<p>")) { return FixHtml(s.Substring(3, s.Length - 3)); } else if (s.ToLower().EndsWith("</div>")) { return FixHtml(s.Substring(0, s.Length - 6)); } else if (s.ToLower().EndsWith("</p>")) { return FixHtml(s.Substring(0, s.Length - 4)); } return s; } 
+2
source

To do this, you need to use StringBuilder to work, or make copies of the strings in each FixHTML call for this to work. This is because .NET strings are immutable in .NET.

You can look here to find out which immutable lines are.

+2
source

If you plan on doing this on the server, you should use the line builder. The reason is that memory performance will be HORRENDOUS if you use strings. Effectively every time you remove a tag from your string, you actually copy the string. For each recursion (tag), your system will do this, so if you even have a reasonable size of HTML input, you use a huge amount of memory very quickly.

EDIT: Regarding Chris's comment, this previous statement is true if you are dealing with large lines. If you parse small chunks of HTML using a line builder, this is not so important. But I made the assumption that you are using this on a server in a web environment, so you can consume very large pages with it.

Using a string builder as a reference will also allow your function to manipulate the mutable value, so at the end of your recursion, StringBuilder.ToString () will correctly output your mutated string.

You should migrate others who mentioned row volatility as your problem, if you enhance my solution, please :).

I tried to answer your problem and fix the next one, which believes that this is a mistake that many have made before.

Also note that your code will die on <br/>

 private static string FixHtml(StringBuilder bldr) { if (String.Compare(blder.ToString(0,5), "<div>", true) == 0) { blder.remove(0, 5); return FixHtml(blder); } else if (String.Compare(blder.ToString(0,3), "<p>", true) == 0) { blder.remove(0, 3); return FixHtml(blder); } else if (String.Compare(blder.ToString(bldr.Length - 6, 6), "</div>", true) == 0) { blder.remove(blder.Length - 6, 6); return FixHtml(blder); } else if (String.Compare(blder.ToString(bldr.Length - 4, 4), "</p>", true) == 0) { blder.remove(blder.Length - 4, 4); return FixHtml(blder); } return blder.ToString(); } 
+2
source

All Articles