Headless chrome with python pauses when trying to load a file

I use Python, Jupyter, Selenium webdriver and headless chrome (with Canary) on Mac.

I wrote a script that resets a very old website. To download a file from this website, I need to click on a few buttons that eventually lead me to the button that once clicked on it, it loads the CSV file

The problem is that when headless chrome tries to load the target file, it pauses and does nothing (i.e. does not load the required file), although the script terminates (and yes, I closed it at the end of the script)

I tried:

  • Downloading other files (from different websites) and headless chrome seem to download them without any problems (I allowed downloading without the chrome chrome option to download files)
  • Take snapshots of websites to make sure that it moves correctly to the download page (and yes, its navigation is correct).
  • Modify user agent (it seems to use the user agent I expect)
  • Executing the same code without the headless option - it successfully downloads a file using regular chrome
  • Changing JS script plugins and languages ​​in the driver using driver.execute_script(js_that_changes_plugins_and_langs) , but I'm not quite sure how to check if it really executes it or not (and it still doesn't work)

Problems I am facing:

  • I can’t find a way to get only the last download URL, because it seems to use some unique identifiers generated along this path (they are given when you go to the home page and when you navigate between pages on the site) so for each session he is going to change
  • Navigation urls seem to come from an iframe inside the main page (as well as in the following urls), and I'm not quite sure how to test Javascript to create it

I have no problem with the website address, but:

  • You need to go through ~ 6 clicks on different pages to just go to the last page using the download button. These clicks are not intuitive, and it will take a lot of effort to explain how to go where I want.
  • This site is not in English, making it even more difficult to explain how to navigate

I need it to be headless, not regular chrome, since the machine in which we want to run the code is very weak and cannot run the chrome graphical interface

So my question is: does anyone know what the problem is? or at least how can I debug it?

this is more or less the code i am using:

 from selenium import webdriver from selenium.webdriver.chrome.options import Options def enable_download_in_headless_chrome(driver, download_dir): """ there is currently a "feature" in chrome where headless does not allow file download: https://bugs.chromium.org/p/chromium/issues/detail?id=696481 This method is a hacky work-around until the official chromedriver support for this. Requires chrome version 62.0.3196.0 or above. """ # add missing support for chrome "send_command" to selenium webdriver driver.command_executor._commands["send_command"] = ("POST", '/session/' + driver.session_id + '/chromium/send_command') params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': download_dir}} command_result = driver.execute("send_command", params) print("response from browser:") for key in command_result: print("result:" + key + ":" + str(command_result[key])) chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('headless') chrome_options.add_argument('no-sandbox') chrome_options.add_argument('disable-gpu') chrome_options.add_argument('remote-deubgging-port=9222') chrome_options.add_argument('disable-popup-blocking') chrome_options.add_argument('enable-logging') download_dir = # some path here driver = webdriver.Chrome(chrome_options=chrome_options) enable_download_in_headless_chrome(driver, download_dir) ok_button = driver.find_element_by_id('the-button-name') ok_button.click() 

thanks for the help

+8
python selenium-webdriver google-chrome-headless
source share
3 answers

Since you are not specifying the URL from which you download its guessing. Most likely, a mount similar to recapta is used to prevent scraping. Therefore, make sure that you do not click this “recapta” wall, and if you implement a code that notifies you of a manual task to provide access.

For js, this solution was asked by zavodnyuk here :

try installing a custom User-Agent with a compatible one (for example, from your real browser). Features: {'browserName': 'chrome', chromeOptions: {args: ["user-agent = Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit / 537.36 (KHTML, for example, Gecko) Chrome / 60.0.3112.113 Safari / 537.36" , "--headless", "--disable-gpu"]} worked on selenium / protractor on js

I hope this alludes to you in the right direction, since little is known about this for python on the Internet.

EDIT based on comment1:

In the main debug mode, I rely on fingerprints at the beginning of possible candidate defs. Where I say printstatement, it could also be a write string. Without relying on thrid party fancy packages, because I want to learn from code most of the time, and then above time, spending a lot of time, but it's well worth the time. For example, how I roughly debug:

 def header_inspect(self, ID, action, data): print 'header_inspect, ID : %s\n, action : %s\nprocess-data : %s' % (ID, action, data) 
0
source share

I think there are too many moving parts. If you really need selenium and everyone else - well - that's fine. However, I would start with something as simple as possible.

In Python 2.7, I used mechanize - this way I was able to simulate all communication with the server. This is not the best option today, since python 3.X is the way to go. I will tell you how I worked with such problems. Just to give you a better picture, and then I will try to describe the possible tools.

Such a typical case was logging in, turning the page, turning on some switches and starting the download, or loading the contents and processing it using a beautiful soup . First you need to know what information is exchanged. Go to the development tools in your web browser and select the network tab. You may know this, but this step is a must, and I should write a general answer. Then do your normal work - just log in and do other things. Everything that the server takes care of must be transferred, so you can see it as network requests. The mechanization was good since I was able to prepare a dict and sent it as a post request to the page. Writing post - a typical mistake - posting to the page address . Therefore, if you visited index.html , you post on this page, while the server expects it to be sent to add_user_data.html , and after that you redirect the goods. Things like a session identifier can be supported by entering a header or cookie - just look at the network link for the template.

As I already wrote, Python 2.7 will be discontinued. The mechanism is not available for Python 3.x, so you need to use other tools. You can look for alternatives to mechanization and see what is good for you. The typical answer is scrapy . This is a slightly different tool used more to remove web pages. Therefore, if you are planning something more, perhaps this is the best option. If you need a single script - I would start with httpie . Command line tool / python package, good OSX support, you can submit the form , session management . I use it every day, however my server is stateless.

I would be happier to provide accurate examples, but without server information, which is not possible. Can you dump your test session? Anonymize it, and I will provide a sample sample, or maybe another tool can be too much?

0
source share

Without any specific information, it seems that only the advice we can give you is somehow related to how you can understand what is happening.

How to proceed step by step manually in chapter mode for the purpose of debugging? The bet here is that your problem is that you are automating your task, not headless.

Run your script with all your import and function definitions (e.g. enable_download_in_headless_chrome ) using none of them. Actually, do this before download_dir = # some path here , and then in the Python shell, type

 >>> driver = webdriver.Chrome(chrome_options=chrome_options) 

Now interact manually with your browser and open Chrome DevTools and go to Console . Make sure that errors are displayed. Continue and enter the rest of the commands

 >>> enable_download_in_headless_chrome(driver, download_dir) >>> ... >>> ok_button.click() 

What he says?

0
source share

All Articles