Split HTML after N words in python

Question

Split HTML after N words in python

Is there a way to split a long HTML string after N words? Obviously, I could use:

' '.join(foo.split(' ')[:n])

to get the first n words of a simple text string, but it can split in the middle of the html tag and will not produce a valid html because it does not close the tags that were open.

I need to do this on the zope / plone website - if there is something standard in those products that can do this, it will be perfect.

For example, let's say I have text:

 <p>This is some text with a <a href="http://www.example.com/" title="Example link"> bit of linked text in it </a>. </p>

And I ask him to be divided into 5 words, he should return:

 <p>This is some text with</p>

7 words:

 <p>This is some text with a <a href="http://www.example.com/" title="Example link"> bit </a> </p>

+7

python html zope plone

rjmunro Dec 11 '08 at 16:47

source share

4 answers

I heard that Beautiful Soup is very good at html. This will probably help you get the correct html.

+3

recursive Dec 11 '08 at 16:58

source share

I wanted to mention the basic HTMLParser , which was built in Python, since I'm not sure if the end result is trying to get to it, it may or may not get you there, you will work with the handlers first

0

curtisk Dec 11 '08 at 17:07

source share

You can use a combination of regular expressions, BeautifulSoup or Tidy (I prefer BeautifulSoup). The idea is simple - first split all the HTML tags. Find the nth word (n = 7 here), find the number of times the nth word appears on the line until n words coz u look for only the last occurrence that will be used for truncation.

Here is a snippet of code, although a little dirty, but it works

 import re from BeautifulSoup import BeautifulSoup import tidy def remove_html_tags(data): p = re.compile(r'<.*?>') return p.sub('', data) input_string='<p>This is some text with a <a href="http://www.example.com/" '\ 'title="Example link">bit of linked text in it</a></p>' s=remove_html_tags(input_string).split(' ')[:7] ###required to ensure that only the last occurrence of the nth word is # taken into account for truncating. # coz if the nth word could be 'a'/'and'/'is'....etc # which may occur multiple times within n words temp=input_string k=s.count(s[-1]) i=1 j=0 while i<=k: j+=temp.find(s[-1]) temp=temp[j+len(s[-1]):] i+=1 #### output_string=input_string[:j+len(s[-1])] print "\nBeautifulSoup\n", BeautifulSoup(output_string) print "\nTidy\n", tidy.parseString(output_string)

Conclusion is what you want

 BeautifulSoup <p>This is some text with a <a href="http://www.example.com/" title="Example link">bit</a></p> Tidy <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"> <html> <head> <meta name="generator" content= "HTML Tidy for Linux/x86 (vers 6 November 2007), see www.w3.org"> <title></title> </head> <body> <p>This is some text with a <a href="http://www.example.com/" title="Example link">bit</a></p> </body> </html>

Hope this helps

Edit: Best regex

 `p = re.compile(r'<[^<]*?>')`

0

JV. Dec 11 '08 at 18:24

source share

Carl Meyer · Accepted Answer · 2008-12-11T18:03:44+0000

Take a look at the truncate_html_words function in the django.utils.text file. Even if you are not using Django, the code there does exactly what you want.

Split HTML after N words in python

More articles: