Can I use BeautifulSoup to embed in embedded JavaScript?

I want to clear a data block from a series of pages that have data hidden in a JSON object inside a script tag. I'm pretty comfortable with BeautifulSoup, but I think I could bark the wrong tree, trying to use it to get data from JavaScript.

The structure of the pages is approximately:

...
<script>
  $(document).ready(function(){
    var data = $.data(graph_selector, [
         { data: charts.createData("Stuff I want")}
    ])};
</script>

The head and body have a million scripts each, but there is only one var dataper page. I'm not sure how I would identify this specific one <script>for BeautifulSoup, except for the presencevar data

Can I do it? Or do I need another tool?

+4
source share
1 answer

BeautifulSoup HTML, javascript.

, :

  • javascript, slimit

    from bs4 import BeautifulSoup
    from slimit import ast
    from slimit.parser import Parser
    from slimit.visitors import nodevisitor
    
    data = """
    <script>
        var data = $.data(graph_selector, [
             { data: charts.createData("Stuff I want")}
        ]);
    </script>
    """
    
    soup = BeautifulSoup(data)
    script = soup.find('script')
    
    
    parser = Parser()
    tree = parser.parse(script.text)
    print next(node.args[0].value for node in nodevisitor.visit(tree)
               if isinstance(node, ast.FunctionCall) and node.identifier.identifier.value == 'createData')
    # prints "Stuff I want"
    

    , script - . script, , .

  • ( , , , JS- ):

    import re
    from bs4 import BeautifulSoup
    
    data = """
    <script>
    $(document).ready(function() {
    var data = $.data(graph_selector, [{data: charts.createData("Stuff I want")}])};
    </script>
    """
    
    soup = BeautifulSoup(data)
    script = soup.find('script')
    
    pattern = r'charts.createData\("(.*?)"\)'
    print re.search(pattern, script.text).group(1)  # prints "Stuff I want"
    
  • smth javascript: selenium ( ) V8 PyExecJS

+2

All Articles