I have xml:
<article> <uselesstag></uslesstag> <topic>oil, gas</topic> <body>body text</body> </article> <article> <uselesstag></uslesstag> <topic>food</topic> <body>body text</body> </article> <article> <uselesstag></uslesstag> <topic>cars</topic> <body>body text</body> </article>
There are many, many unnecessary tags. I want to use beautifulsoup to collect all the text in body tags and related theme text to create a new xml.
I am new to python, but I suspect some form
import arff from xml.etree import ElementTree import re from StringIO import StringIO import BeautifulSoup from BeautifulSoup import BeautifulSoup totstring="" with open('reut2-000.sgm', 'r') as inF: for line in inF: string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line) totstring+=string soup = BeautifulSoup(totstring) body = soup.find("body") for anchor in soup.findAll('body'):
will work.
1) How do I do this? 2) Should I add the root node in XML? otherwise is this not the correct XML is this?
Many thanks
Edit:
What I want to finish:
<article> <topic>oil, gas</topic> <body>body text</body> </article> <article> <topic>food</topic> <body>body text</body> </article> <article> <topic>cars</topic> <body>body text</body> </article>
There are many, many unnecessary tags.
source share