BeautifulSoup FindAll

Question

BeautifulSoup FindAll

I have xml:

<article> <uselesstag></uslesstag> <topic>oil, gas</topic> <body>body text</body> </article> <article> <uselesstag></uslesstag> <topic>food</topic> <body>body text</body> </article> <article> <uselesstag></uslesstag> <topic>cars</topic> <body>body text</body> </article>

There are many, many unnecessary tags. I want to use beautifulsoup to collect all the text in body tags and related theme text to create a new xml.

I am new to python, but I suspect some form

 import arff from xml.etree import ElementTree import re from StringIO import StringIO import BeautifulSoup from BeautifulSoup import BeautifulSoup totstring="" with open('reut2-000.sgm', 'r') as inF: for line in inF: string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line) totstring+=string soup = BeautifulSoup(totstring) body = soup.find("body") for anchor in soup.findAll('body'): #Stick body and its topics in an associated array? file.close

will work.

1) How do I do this? 2) Should I add the root node in XML? otherwise is this not the correct XML is this?

Many thanks

Edit:

What I want to finish:

 <article> <topic>oil, gas</topic> <body>body text</body> </article> <article> <topic>food</topic> <body>body text</body> </article> <article> <topic>cars</topic> <body>body text</body> </article>

There are many, many unnecessary tags.

+4

python xml beautifulsoup

RNs_Ghost May 09 '12 at 15:17

source share

2 answers

Another way to remove empty xml or html tags is to use a recursive function to search for empty tags and remove them with .extract (). Thus, you do not need to manually specify the tags that you want to keep. It also allows you to clear empty tags that are nested.

 from bs4 import BeautifulSoup import re nonwhite=re.compile(r'\S+',re.U) html_doc1=""" <article> <uselesstag2> <uselesstag1> </uselesstag1> </uselesstag2> <topic>oil, gas</topic> <body>body text</body> </article> <p>21.09.2009</p> <p> </p> <p1><img src="http://www.www.com/"></p1> <p></p> <!--- This article is about cars---> <article> <topic>cars</topic> <body>body text</body> </article> """ def nothing_inside(thing): # select only tags to examine, leave comments/strings try: # check for img empty tags if thing.name=='img' and thing['src']<>'': return False else: pass # check if any non-whitespace contents for item in thing.contents: if nonwhite.match(item): return False else: pass return True except: return False def scrub(thing): # loop function as long as an empty tag exists while thing.find_all(nothing_inside,recursive=True) <> []: for emptytag in thing.find_all(nothing_inside,recursive=True): emptytag.extract() scrub(thing) return thing soup=BeautifulSoup(html_doc1) print scrub(soup)

Result:

 <article> <topic>oil, gas</topic> <body>body text</body> </article> <p>21.09.2009</p> <p1><img src="http://www.www.com/"/></p1> <!--- This article is about cars---> <article> <topic>cars</topic> <body>body text</body> </article>

+1

Kao Aug 16 '12 at 16:51

source share

Arthur neves · Accepted Answer · 2012-05-09T15:49:14+0000

OK. there is a solution

make sure you have "beautifulsoup4" installed: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

here is my code to get body tags and topics:

 from bs4 import BeautifulSoup html_doc= """ <article> <topic>oil, gas</topic> <body>body text</body> </article> <article> <topic>food</topic> <body>body text</body> </article> <article> <topic>cars</topic> <body>body text</body> </article> """ soup = BeautifulSoup(html_doc) bodies = [a.get_text() for a in soup.find_all('body')] topics = [a.get_text() for a in soup.find_all('topic')]

BeautifulSoup FindAll

More articles: