Getting a substring of text containing HTML tags

Question

Getting a substring of text containing HTML tags

Suppose you need the first 10 characters:

" this is paragraph 1

this is paragraph 2 "

The conclusion will be:

" this"

The returned text contains a private P tag. If this is displayed on the page, the open P tag would affect the subsequent content. Ideally, the preferred output would close any closed HTML tags in the reverse order when they were open:

" this & lt; / p>" I want a function that returns an HTML substructure, making sure that the tags do not remain closed

+4

string html asp.net

Shameem Apr 17 '09 at 7:18

source share

5 answers

Rahul · Answer 1 · 2009-04-17T08:15:30+0000

You need to teach your code to understand that your string is HTML or XML. Just treating it as a string will not allow you to work with it the way you want. This means that you first convert it to the correct format and then work with this format.

Use the XSL stylesheet

If your HTML is well-formed XML, load it into an XMLDocument and run it through an XSL stylesheet, which will do something like this:

 <xsl:template match="p"> <xsl:value-of select="substring(text(), 0, 10)" /> </xsl:template>

Use HTML parser

If this is not well-formed XML (as in your example, where you have a sudden  in the middle), you will need to use some kind of HTML parser , for example HTML Agility Pack (see the question about C # HTML parsers ) .

Do not use regular expressions, as HTML is too complex to parse using regular expressions .

Chuhukon · Answer 2 · 2011-04-06T18:35:49+0000

You can use the following static function. For a working example, check out: http://www.koodr.com/item/438c2e9c-62a8-45fc-9ca2-db1479f412e1 . You can also turn this into an extension method.

 public static string HtmlSubstring (string html, int maxlength) { //initialize regular expressions string htmltag = "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>"; string emptytags = "<(\\w+)((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?></\\1>"; //match all html start and end tags, otherwise get each character one by one.. var expression = new Regex(string.Format("({0})|(.?)", htmltag)); MatchCollection matches = expression.Matches(html); int i = 0; StringBuilder content = new StringBuilder(); foreach (Match match in matches) { if (match.Value.Length == 1 && i < maxlength) { content.Append(match.Value); i++; } //the match contains a tag else if (match.Value.Length > 1) content.Append(match.Value); } return Regex.Replace(content.ToString(), emptytags, string.Empty); }

Cerebrus · Answer 3 · 2009-04-17T07:34:18+0000

Your requirement is very unclear, so most of them are guesswork. In addition, you did not provide any code that would help clarify what you want to do.

One solution could be:

a. Find the text between the  and  tags. You can use the following Regex to do this, or use a simple string search:

 \<p\>(.*?)\</p\>

b. In the found text, apply Substring() to extract the desired text.

with. Return the extracted text between the  and  tags.

Fenton · Answer 4 · 2009-04-17T07:26:05+0000

You can scroll the html line to detect angle brackets and create an array of tags and whether there was a corresponding closing tag for each of them. The problem is that HTML allows you to use non-closing tags such as img, br, meta - so you will need to find out about this. You will also need to have rules to check the closing order, because just matching opening with closing does not make valid HTML - if you open a div, then ap and then close the div and then close p, which isn’t valid.

imxylz · Answer 5 · 2012-12-28T11:59:46+0000

try this code (python 3.x):

 notags=('img','br','hr') def substring2(html,size): if len(html) <= size: return html result,tag,count='','',0 tags=[] for c in html: result += c if c == '<': intag=True elif c=='>': intag=False tag=tag.split()[0] if tag[0] == '/': tag = tag.replace('/','') if tag not in notags: tags.pop() else: if tag[-1] != '/' and tag not in notags: tags.append(tag) tag='' else: if intag: tag += c else: count+=1 if count>=size: break while len(tags)>0: result += '</{0}>'.format(tags.pop()) return result s='<div class="main">html <code>substring</code> function written by <span>imxylz</span>, using <a href="http://www.python.org">python</a> language</div>' print(s) for size in (30,40,55): print(substring2(s,size))

Output

 <div class="main">html <code>substring</code> function written by <span>imxylz</span>, using <a href="http://www.python.org">python</a> language</div> <div class="main">html <code>substring</code> function writte</div> <div class="main">html <code>substring</code> function written by <span>imxyl</span></div> <div class="main">html <code>substring</code> function written by <span>imxylz</span>, using <a href="http://www.python.org">python</a></div>

more

See the github code.

Another question .

Getting a substring of text containing HTML tags

Use the XSL stylesheet

Use HTML parser

More articles: