How to remove text between <script> and </script> using python?

how to remove text between <script>and </script>using python?

+5
source share
9 answers

You can use BeautifulSoup with this (and other) methods:

soup = BeautifulSoup(source.lower())
to_extract = soup.findAll('script')
for item in to_extract:
    item.extract()

This actually removes the nodes from the HTML. If you want to leave empty tags <script></script>, you will have to work with attributes item, and not just extract them from the soup.

+25
source

XSS? <script> ! ( ), http://ha.ckers.org/xss.html. , <script> . python lxml , HTML, .

, <script>, lxml :

from lxml.html import parse

root = parse(filename_or_url).getroot()
for element in root.iter("script"):
    element.drop_tree()

.. , . . , HTML : HTML: ?

2: SO, HTML, : , XML HTML ?

+5

HTMLParser () :

import re
content = "asdf <script> bla </script> end"
x=re.search("<script>.*?</script>", content, re.DOTALL)
span = x.span() # gives (5, 27)

stripped_content = content[:span[0]] + content[span[1]:]

EDIT: re.DOTALL, tgray

0

<script> </script>, node?

src resig-?

0

, Pev wr, , :

pattern = r"(?is)<script[^>]*>(.*?)</script>"
text = """<script>foo bar  
baz bar foo  </script>"""
re.sub(pattern, '', text)

(? is) - - . script .

EDIT: , . , . lxml . , . Beautiful Soup (? , , ).

, , :

pattern = r"(?is)(<script[^>]*>)(.*?)(</script>)"
text = """<script>foo bar  
baz bar foo  </script>"""
re.sub(pattern, '\1\3', text)
0

Element Tree - . , ; - , ! ( )

0
source

I don't know, Python is good enough to tell you a solution. But if you want to use this to disinfect user input, you have to be very careful. To delete things between them and just not all. Perhaps you can take a look at existing solutions (I suppose Django includes something like this).

-1
source
example_text = "This is some text <script> blah blah blah </script> this is some more text."

import re
myre = re.compile("(^.*)<script>(.*)</script>(.*$)")
result = myre.match(example_text)
result.groups()
  <52> ('This is some text ', ' blah blah blah ', ' this is some more text.')

# Text between <script> .. </script>
result.group(2)
  <56> 'blah blah blah'

# Text outside of <script> .. </script>
result.group(1)+result.group(3)
  <57> 'This is some text  this is some more text.'
-1
source

If you do not want to import any modules:

string = "<script> this is some js. begone! </script>"

string = string.split(' ')

for i, s in enumerate(string):
    if s == '<script>' or s == '</script>' :
        del string[i]

print ' '.join(string)
-1
source

All Articles