Clear HTML from a webpage after rendering JSON / Javascript without Selenium, supporting POSTing

Question

Clear HTML from a webpage after rendering JSON / Javascript without Selenium, supporting POSTing

I am trying to use the following code, which works fine, but it does not allow JavaScript to run first, which means that I do not get the desired HTML code from the web page.

I looked at DryScrape , but as far as I can see, it does not support publishing, as you can see in the auto_login() function, the same goes for PyQt4 .

Does this website have 4 bits of JSON "listings"? which form the page when loading / rendering; if I look at the source code, it doesn’t look beautiful, and I can’t easily find the material in it, however, if I “Check Element” on the page, the HTML looked perfect, and then I can easily view it using BeautifulSoup .

I know that I can use Selenium , but this is not what I want to do mainly because it starts in the background, I can use PhantomJS or PyVirtualDisplay for this, but this will only be a last resort.

 import requests from bs4 import BeautifulSoup HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en-US,en;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'} SESSION = requests.session() RESPONSE = requests.Response email = " mail@mail.com " pass = "password123 def auto_login(): global RESPONSE url = "https://website.com/log.php" payload = { "log":email, "pwd":pass, "finish":"https://website.com/listed/public/gen1/", } RESPONSE = SESSION.post(url, data=payload, headers=HEADERS, verify=False) def process_html(): PROCESSED_HTML = BeautifulSoup(RESPONSE.content, 'html.parser') return PROCESSED_HTML def main(): auto_login() PROCESSED_HTML = get_html() if __name__ == "__main__": main()

How to easily render a JavaScript web page using my script (modified), preferably with PyQt4 ( DryScrape by default will not install correctly on my Windows 10, Python 2.7) without using Selenium .

Any ideas would be appreciated.

+6

python python-2.7 selenium web-scraping pyqt

Ryflex May 09 '16 at 13:43

source share

1 answer

Nicarus · Answer 1 · 2016-05-15T20:15:19+0000

I cannot get dryscape to work for me, so I cannot verify this; however, I think this idea / hack might work - perhaps minor changes will be required.

The idea is that you create a small local HTML file with the form and parameters that you need to pass. This HTML file is created in Python so that you can pass values to each parameter.

According to the docs, dryscrape can submit forms, so you submit your form, and it should take you where you need to.

Sort of:

 import dryscrape payload = {'log':email, 'pwd':pass, 'finish':'https://website.com/listed/public/gen1/'} html = ''' <form action="https://website.com/log.php" method="POST"> <input name="log" type="text" value="{log}"> <input name="pwd" type="password" value="{pwd}"> <input name="finish" type="text" value="{finish}"> <input type="submit"> </form> '''.format(**payload) with open('./temp.html','w') as hf: hf.write(html) sess = dryscrape.Session(base_url = './') # maybe 'file://' is needed? q = sess.visit('/temp.html') q.form().submit() # Remainder of your code to follow...

Again, I cannot verify this myself, so some minor changes may be required.

Not very elegant, but just a thought ... :-)

Clear HTML from a webpage after rendering JSON / Javascript without Selenium, supporting POSTing

More articles: