I suggest using Tidy.NET to clear messy input
Tidy.NET has a good API for getting a list of problems ( MessageCollection ) in your "XML", and you can use it to fix a text stream in memory. The simplest thing would be to fix one error at a time, I thought that it would not work too well with many errors. Otherwise, you can correct errors in the reverse order of the document so that message offsets remain valid when performing corrections.
Here is an example to convert HTML input to XHTML:
Tidy tidy = new Tidy ();
tidy.Options.DocType = DocType.Strict; tidy.Options.DropFontTags = true; tidy.Options.LogicalEmphasis = true; tidy.Options.Xhtml = true; tidy.Options.XmlOut = true; tidy.Options.MakeClean = true; tidy.Options.TidyMark = false; TidyMessageCollection tmc = new TidyMessageCollection(); MemoryStream input = new MemoryStream(); MemoryStream output = new MemoryStream(); byte[] byteArray = Encoding.UTF8.GetBytes("Put your HTML here..."); input.Write(byteArray, 0 , byteArray.Length); input.Position = 0; tidy.Parse(input, output, tmc); string result = Encoding.UTF8.GetString(output.ToArray());
source share