Clear HTML from a webpage after rendering JSON / Javascript without Selenium, supporting POSTing

I am trying to use the following code, which works fine, but it does not allow JavaScript to run first, which means that I do not get the desired HTML code from the web page.

I looked at DryScrape , but as far as I can see, it does not support publishing, as you can see in the auto_login() function, the same goes for PyQt4 .

Does this website have 4 bits of JSON "listings"? which form the page when loading / rendering; if I look at the source code, it doesn’t look beautiful, and I can’t easily find the material in it, however, if I “Check Element” on the page, the HTML looked perfect, and then I can easily view it using BeautifulSoup .

I know that I can use Selenium , but this is not what I want to do mainly because it starts in the background, I can use PhantomJS or PyVirtualDisplay for this, but this will only be a last resort.

 import requests from bs4 import BeautifulSoup HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en-US,en;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'} SESSION = requests.session() RESPONSE = requests.Response email = " mail@mail.com " pass = "password123 def auto_login(): global RESPONSE url = "https://website.com/log.php" payload = { "log":email, "pwd":pass, "finish":"https://website.com/listed/public/gen1/", } RESPONSE = SESSION.post(url, data=payload, headers=HEADERS, verify=False) def process_html(): PROCESSED_HTML = BeautifulSoup(RESPONSE.content, 'html.parser') return PROCESSED_HTML def main(): auto_login() PROCESSED_HTML = get_html() if __name__ == "__main__": main() 

How to easily render a JavaScript web page using my script (modified), preferably with PyQt4 ( DryScrape by default will not install correctly on my Windows 10, Python 2.7) without using Selenium .

Any ideas would be appreciated.

+6
source share
1 answer

I cannot get dryscape to work for me, so I cannot verify this; however, I think this idea / hack might work - perhaps minor changes will be required.

The idea is that you create a small local HTML file with the form and parameters that you need to pass. This HTML file is created in Python so that you can pass values ​​to each parameter.

According to the docs, dryscrape can submit forms, so you submit your form, and it should take you where you need to.

Sort of:

 import dryscrape payload = {'log':email, 'pwd':pass, 'finish':'https://website.com/listed/public/gen1/'} html = ''' <form action="https://website.com/log.php" method="POST"> <input name="log" type="text" value="{log}"> <input name="pwd" type="password" value="{pwd}"> <input name="finish" type="text" value="{finish}"> <input type="submit"> </form> '''.format(**payload) with open('./temp.html','w') as hf: hf.write(html) sess = dryscrape.Session(base_url = './') # maybe 'file://' is needed? q = sess.visit('/temp.html') q.form().submit() # Remainder of your code to follow... 

Again, I cannot verify this myself, so some minor changes may be required.

Not very elegant, but just a thought ... :-)

0
source

All Articles