Losing the less than sign in HtmlAgilityPack loadhtml

I recently started experimenting with HtmlAgilityPack. I am not familiar with all of its options, and I think that I am doing something wrong.

I have a line with the following contents:

string s = "<span style=\"color: #0000FF;\"><</span>"; 

You see that in my gap I have a lesser sign. I process this line with the following code:

 HtmlDocument htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(s); 

But when I make a quick and dirty look in between, like this:

 htmlDocument.DocumentNode.ChildNodes[0].InnerHtml 

I see that the blank is empty.

Which parameter do I need to set, save the less sign. I have already tried this:

 htmlDocument.OptionAutoCloseOnEnd = false; htmlDocument.OptionCheckSyntax = false; htmlDocument.OptionFixNestedTags = false; 

but without success.

I know this is invalid HTML. I use this to fix invalid HTML and use HTMLEncode on less than characters

Please guide me in the right direction. thanks in advance

+8
html c # html-agility-pack
source share
5 answers

Html Agility Packs detects this as an error and creates an instance of HtmlParseError for it. You can read all errors using the ParseErrors class of the HtmlDocument class. So, if you run this code:

  string s = "<span style=\"color: #0000FF;\"><</span>"; HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(s); doc.Save(Console.Out); Console.WriteLine(); Console.WriteLine(); foreach (HtmlParseError err in doc.ParseErrors) { Console.WriteLine("Error"); Console.WriteLine(" code=" + err.Code); Console.WriteLine(" reason=" + err.Reason); Console.WriteLine(" text=" + err.SourceText); Console.WriteLine(" line=" + err.Line); Console.WriteLine(" pos=" + err.StreamPosition); Console.WriteLine(" col=" + err.LinePosition); } 

It will display this (first corrected text and error details):

 <span style="color: #0000FF;"></span> Error code=EndTagNotRequired reason=End tag </> is not required text=< line=1 pos=30 col=31 

Thus, you can try to correct this error, since you have all the necessary information (including row, column and stream position), but the general process of fixing (not detecting) errors in HTML is very complicated.

+4
source share

As mentioned in another answer, the best solution I found was to parse the HTML code to convert orphaned characters < to their encoded HTML value &lt; .

 return Regex.Replace(html, "<(?![^<]+>)", "&lt;"); 
+3
source share

Correct the markup because your HTML line is not valid:

 string s = "<span style=\"color: #0000FF;\">&lt;</span>"; 
+2
source share

Although it’s true that this html is invalid, HtmlAgilityPack will still be able to parse it. On the net, you often have to forget to encode " < ", and if HtmlAgilityPack is used as a crawler, then it should expect bad html. I tested the example in IE, Chrome, and Firefox, and they all display extra text < as text.

I wrote the following method that you can use to preprocess the html string and replace all the characters "unclosed" '<' with "&lt;" :

 static string PreProcess(string htmlInput) { // Stores the index of the last unclosed '<' character, or -1 if the last '<' character is closed. int lastGt = -1; // This list will be populated with all the unclosed '<' characters. List<int> gtPositions = new List<int>(); // Collect the unclosed '<' characters. for (int i = 0; i < htmlInput.Length; i++) { if (htmlInput[i] == '<') { if (lastGt != -1) gtPositions.Add(lastGt); lastGt = i; } else if (htmlInput[i] == '>') lastGt = -1; } if (lastGt != -1) gtPositions.Add(lastGt); // If no unclosed '<' characters are found, then just return the input string. if (gtPositions.Count == 0) return htmlInput; // Build the output string, replace all unclosed '<' character by "&lt;". StringBuilder htmlOutput = new StringBuilder(htmlInput.Length + 3 * gtPositions.Count); int start = 0; foreach (int gtPosition in gtPositions) { htmlOutput.Append(htmlInput.Substring(start, gtPosition - start)); htmlOutput.Append("&lt;"); start = gtPosition + 1; } htmlOutput.Append(htmlInput.Substring(start)); return htmlOutput.ToString(); } 
+2
source share

the string "s" is bad html.

 string s = "<span style=\"color: #0000FF;\">&lt;</span>"; 

it's true.

0
source share

All Articles