If you want to avoid extracting the contents of the script tag using BeautifulSoup,
nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)
will do this for you, getting the root immediate children that are not script tags (and a separate htmlDom.findAll(recursive=False, text=True) will get strings that are the immediate children of the root). You need to do this recursively; for example, as a generator:
def nonScript(tag): return tag.name != 'script' def getStrings(root): for s in root.childGenerator(): if hasattr(s, 'name'):
I use childGenerator (instead of findAll ) so that I can just tidy up all the children and do my own filtering.
source share