How to get web page resource content using chrome remote debugging

I want to get the contents of a resource web page using python through Chrome debugging protocol to this page method-getResourceContent , I noticed this method: getResourceContent, need params frameId and url.i believe that this method is necessary to me. so I did the following:

1.get start chrome as server:. \ Chrome.exe --remote-debugging-port = 9222

2.Paste python test code:

# coding=utf-8 """ chrome --remote-debugging api test """ import json import requests import websocket import pdb def send(): geturl = requests.get('http://localhost:9222/json') websocketURL = json.loads(geturl.content)[0]['webSocketDebuggerUrl'] request = {} request['id'] = 1 request['method'] = 'Page.navigate' request['params'] = {"url": 'http://global.bing.com'} ws = websocket.create_connection(websocketURL) ws.send(json.dumps(request)) res = ws.recv() ws.close() print res frameId = json.loads(res)['result']['frameId'] print frameId geturl = requests.get('http://localhost:9222/json') websocketURL = json.loads(geturl.content)[0]['webSocketDebuggerUrl'] req = {} req['id'] = 1 req['method'] = 'Page.getResourceContent' req['params'] = {"frameId":frameId,"url": 'http://global.bing.com'} header = ["User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"] pdb.set_trace() ws = websocket.create_connection(websocketURL,header=header) ws.send(json.dumps(req)) ress = ws.recv() ws.close() print ress if __name__ == '__main__': send() 

3.Page.navigate work fine, I got something like this: {"ID": 1, "result": {"frameId": "8504,2"}}

4. When I try the method: getResourceContent, an error occurs: {"error": {"code": - 32000, "message": "Agent is not turned on." }, "id": 1}

I tried to add User-Agent, still not working.

Thanks.

+6
source share
1 answer

The “Agent not included” error message has nothing to do with the HTTP User-Agent header, but refers to the agent inside chrome that must be enabled to receive the page content.

The term "agent" is a bit misleading, since the protocol documentation talks about the domains that must be enabled for debugging (the term "agent" refers to how it is implemented inside Chrome, I suppose)

So, the question is, which domain must be activated to access the contents of the page? Looking back, this is quite obvious: the Page domain must be included, as we call the method in this domain. I found this only after I stumbled about in this example .

As soon as I added the Page.enable request to the script to activate the Page domain, the error message disappeared. However, I ran into two other problems:

  • The connection between the web connections should be kept publicly available between requests, as Chrome retains some state between calls (for example, is the agent turned on)
  • When you go to http://global.bing.com/, the browser redirects to http://www.bing.com/ (at least on my computer). This leads to the fact that Page.getResourceContent not be able to retrieve the resource because the requested resource http://global.bing.com/.

After fixing these issues, I was able to get the contents of the page. This is my code:

 # coding=utf-8 """ chrome --remote-debugging api test """ import json import requests import websocket def send(): # Setup websocket connection: geturl = requests.get('http://localhost:9222/json') websocketURL = json.loads(geturl.content)[0]['webSocketDebuggerUrl'] ws = websocket.create_connection(websocketURL) # Navigate to global.bing.com: request = {} request['id'] = 1 request['method'] = 'Page.navigate' request['params'] = {"url": 'http://global.bing.com'} ws.send(json.dumps(request)) result = ws.recv() print "Page.navigate: ", result frameId = json.loads(result)['result']['frameId'] # Enable page agent: request = {} request['id'] = 1 request['method'] = 'Page.enable' request['params'] = {} ws.send(json.dumps(request)) print 'Page.enable: ', ws.recv() # Retrieve resource contents: request = {} request['id'] = 1 request['method'] = 'Page.getResourceContent' request['params'] = {"frameId": frameId, "url": 'http://www.bing.com'} ws.send(json.dumps(request)) result = ws.recv() print("Page.getResourceContent: ", result) # Close websocket connection ws.close() if __name__ == '__main__': send() 
+2
source

All Articles