import os from bs4 import BeautifulSoup do = dir_with_original_files = 'C:\FOLDER' dm = dir_with_modified_files = 'C:\FOLDER' for root, dirs, files in os.walk(do): for f in files: print f.title() if f.endswith('~'): #you don't want to process backups continue original_file = os.path.join(root, f) mf = f.split('.') mf = ''.join(mf[:-1])+'_mod.'+mf[-1] # you can keep the same name # if you omit the last two lines. # They are in separate directories # anyway. In that case, mf = f modified_file = os.path.join(dm, mf) with open(original_file, 'r') as orig_f, \ open(modified_file, 'w') as modi_f: soup = BeautifulSoup(orig_f.read()) for t in soup.find_all('table'): for child in t.find_all("table"):#*****this is fine for now, but how would I restrict it to find only the first element? child.REMOVE() #******PROBLEM HERE******** # This is where you create your new modified file. modi_f.write(soup.prettify().encode(soup.original_encoding))
Hello to all,
I am trying to parse files using BeautifulSoup to clear them a bit. I want me to want to delete the first table located anywhere in the table, for example:
<table> <tr> <td></td </tr> <tr> <td><table></table><-----This will be deleted</td </tr> <tr> <td><table></table> --- this will remain here.</td </tr> </table>
At the moment, my code is set to find all the tables in the table, and I have a .REMOVE() method to show what I want to execute. How can I remove this item?
Tl; dr -
python html parsing html-parsing beautifulsoup
Simon kely
source share