Retrieve the first 100 characters of HTML content without removing tags

There are many questions about how to break html tags, but not many functions / methods to close them.

Here is the situation. I have a 500 character message summary (including html tags), but I only need the first 100 characters. The problem is that if I truncate the message, it may be in the middle of the html tag ... which messed up the stuff.

Assuming html looks something like this:

<div class="bd">"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <br/>
 <br/>Some Dates: April 30 - May 2, 2010 <br/>
 <p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. <em>Duis aute irure dolor in reprehenderit</em> in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. <br/>
 </p>
 For more information about Lorem Ipsum doemdloe, visit: <br/>
 <a href="http://www.somesite.com" title="Some Conference">Some text link</a><br/> 
</div>

How will I take the first ~ 100 characters? (Although, ideally, this would be the first approximately 100 characters of "CONTENT" (between html tags).

I guess the best way to do this would be a recursive algorithm that tracks html tags and adds tags to be truncated, but this might not be the best approach.

, 100 , "<" html-, .

, . html, .

. , , html . , WYSIWYG.

EDIT:

(, ). , . , ... , , ( , , ),

+5
6

, HTML- (, HTML, XML), , , . , ( - , / ).

, .

ol'noggin, , , , (, ).

+1

. html , "a <b> c". , , - .

    /// <summary>
    /// Gets first number of characters from the html string without stripping tags
    /// </summary>
    /// <param name="htmlString">The html string, not encoded, pure html</param>
    /// <param name="length">The number of first characters to get</param>
    /// <returns>The html string</returns>
    public static string GetFirstCharacters(string htmlString, int length)
    {
        if (htmlString == null)
            return string.Empty;

        if(htmlString.Length < length)
            return htmlString;

        // regex to separate string on parts: tags, texts
        var separateRegex = new Regex("([^>][^<>]*[^<])|[\\S]{1}");
        // regex to identify tags
        var tagsRegex = new Regex("^<[^>]+>$");

        // separate string on tags and texts
        var matches = separateRegex.Matches(htmlString);

        // looping by mathes
        // if it a tag then just append it to resuls,
        // if it a text then append substing of it (considering the number of characters)
        var counter = 0;
        var sb = new StringBuilder();
        for (var i = 0; i < matches.Count; i++)
        {
            var m = matches[i].Value;

            // check if it a tag
            if (tagsRegex.IsMatch(m))
            {
                sb.Append(m);
            }
            else
            {
                var lengthToCut = length - counter;

                var sub = lengthToCut >= m.Length
                    ? m
                    : m.Substring(0, lengthToCut);

                counter += sub.Length;
                sb.Append(sub);
            }
        }

        return sb.ToString();
    }
+4

, HTML DOM, - , , , 100 ?

+3

. , , .

, HTML, . HTML, , .

+1

... , .

If anyone can see any logical errors or inefficiencies, let me know.

I don't know if this is the best approach ... but it seems to work. There are probably cases where it does not work ... and most likely it will crash if the html is incorrect.

/// <summary>
/// Get the first n characters of some html text
/// </summary>
private string truncateTo(string s, int howMany, string ellipsis) {

    // return entire string if it more than n characters
    if (s.Length < howMany)
        return s;

    Stack<string> elements = new Stack<string>();
    StringBuilder sb = new StringBuilder();
    int trueCount = 0;

    for (int i = 0; i < s.Length; i++) {
        if (s[i] == '<') {

            StringBuilder elem = new StringBuilder();
            bool selfclosing = false;

            if (s[i + 1] == '/') {

                elements.Pop(); // Take the previous element off the stack
                while (s[i] != '>') {
                    i++;
                }
            }
            else { // not a closing tag so get the element name

                while (i < s.Length && s[i] != '>') {

                    if ((s[i] >= 'a' && s[i] <= 'z') || (s[i] >= 'A' && s[i] <= 'Z')) {
                        elem.Append(s[i]);
                    }
                    else if (s[i] == '/' || s[i] == ' ') {

                        // self closing tag or end of tag name. Find the end of tag
                        do {
                            if (s[i] == '/' && s[i + 1] == '>') {
                                // at the end of self-closing tag. Don't store
                                selfclosing = true;
                            }

                            i++;
                        } while (i < s.Length && s[i] != '>');
                    }
                    i++;
                } // end while( != '>' )

                if (!selfclosing)
                    elements.Push(elem.ToString());
            } 
        }
        else {
            trueCount++;
            if (trueCount > howMany) {
                sb.Append(s.Substring(0, i - 1));
                sb.Append(ellipsis);
                while (elements.Count > 0) {
                    sb.AppendFormat("</{0}>", elements.Pop());
                }
            }
        }
    }

    return sb.ToString();
}
+1
source

I used XmlReader and XmlWriter for this: https://gist.github.com/2413598

As mentioned above, you should probably use SgmlReader or HtmlAgilityPack to clear incoming lines.

0
source

All Articles