Web scraping a JavaScript page with Python

I am trying to design a simple scraper. I want to extract text without HTML code. In fact, I achieve this goal, but I saw that on some pages where JavaScript was loaded, I did not get good results.

For example, if some JavaScript code adds some text, I don’t see it, because when I call

response = urllib2.urlopen(request) 

I get the source text without added (because JavaScript is executed on the client).

So, I'm looking for some ideas to solve this problem.

+150
python web-scraping urlopen
Nov 08
source share
12 answers

EDIT 30 / Dec / 2017: This answer appears in Google search results, so I decided to update it. The old answer is still at the end.

dryscape is no longer supported, and the developers of the dryscape library recommend using only Python 2. I found using the Selenium python library with Phantom JS as a web driver fast and easy enough to get the job done.

After installing Phantom JS, make sure the phantomjs file is available in the current path:

 phantomjs --version # result: 2.1.1 

example

To give an example, I created a sample page with the following HTML code. ( link ):

 <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>Javascript scraping test</title> </head> <body> <p id='intro-text'>No javascript support</p> <script> document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript'; </script> </body> </html> 

without javascript it says: No javascript support and with javascript: Yay! Supports javascript Yay! Supports javascript Yay! Supports javascript Yay! Supports javascript

Scraper without JS support:

 import requests from bs4 import BeautifulSoup response = requests.get(my_url) soup = BeautifulSoup(response.text) soup.find(id="intro-text") # Result: <p id="intro-text">No javascript support</p> 

JS Scraper:

 from selenium import webdriver driver = webdriver.PhantomJS() driver.get(my_url) p_element = driver.find_element_by_id(id_='intro-text') print(p_element.text) # result: 'Yay! Supports javascript' 



You can also use the hip version of the Python library to clean javascript-driven websites.

JS Scraper:

 import dryscrape from bs4 import BeautifulSoup session = dryscrape.Session() session.visit(my_url) response = session.body() soup = BeautifulSoup(response) soup.find(id="intro-text") # Result: <p id="intro-text">Yay! Supports javascript</p> 
+190
Oct 18 '14 at 2:03
source share

We are not getting the right results, because any content generated by javascript must be displayed in the DOM. When we select an HTML page, we get the original, unchanged using JavaScript, DOM.

Therefore, we need to render the javascript content before scanning the page.

Since selenium has already been mentioned many times in this thread (and its slow speed has also been mentioned sometimes), I will list two other possible solutions.




Solution 1: This is a very good lesson on how to use Scrapy to crawl JavaScript-generated content, and we are going to follow this.

What we need:

  1. Docker installed on our machine. Until now, this is an advantage over other solutions, as it uses an OS independent platform.

  2. Install Splash following the instructions given for our respective OS.
    Quoting from the splash of documentation:

    Splash is a JavaScript rendering service. It is a lightweight web browser with an HTTP API implemented in Python 3 using Twisted and QT5.

    In essence, we will use Splash to render Javascript generated content.

  3. Start the screensaver server: sudo docker run -p 8050:8050 scrapinghub/splash .

  4. Install the scrapy-splash plugin: pip install scrapy-splash

  5. Assuming that we already have a Scrapy project created (if not, let's create it), we will follow the manual and update settings.py :

    Then go to your settings.py scrapy projects and install these middlewares:

     DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } 

    Splash server URL (if you are using Win or OSX, this should be the URL of the docking machine: how to get the IP address of the Docker container from the host? ):

     SPLASH_URL = 'http://localhost:8050' 

    And finally, you need to set these values ​​as well:

     DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' 
  6. Finally, we can use SplashRequest :

    For a regular spider, you have Request objects that you can use to open the URL. If the page you want to open contains data generated by JS, you should use SplashRequest (or SplashFormRequest) to display the page. Here is a simple example:

     class MySpider(scrapy.Spider): name = "jsscraper" start_urls = ["http://quotes.toscrape.com/js/"] def start_requests(self): for url in self.start_urls: yield SplashRequest( url=url, callback=self.parse, endpoint='render.html' ) def parse(self, response): for q in response.css("div.quote"): quote = QuoteItem() quote["author"] = q.css(".author::text").extract_first() quote["quote"] = q.css(".text::text").extract_first() yield quote 

    SplashRequest displays the URL as html and returns a response that you can use in the callback method (parsing).




Solution 2: Let's call this experiment for the moment (May 2018) ...
This solution is only for Python version 3.6 (for now).

Do you know the query module (well, who does not know)?
Now he has a little brother slipping on the Internet: HTML requests :

This library intends to make HTML analysis (such as browsing the web) as simple and intuitive as possible.

  1. Install HTML requests: pipenv install requests-html

  2. Request the URL page:

     from requests_html import HTMLSession session = HTMLSession() r = session.get(a_page_url) 
  3. Display the answer to get the generated Javascript bits:

     r.html.render() 

Finally, the module seems to offer cleaning options .
Alternatively, we can try the well-documented way to use BeautifulSoup with the r.html object we just visualized.

+49
May 30 '18 at 19:52
source share

Perhaps selenium can do this.

 from selenium import webdriver import time driver = webdriver.Firefox() driver.get(url) time.sleep(5) htmlSource = driver.page_source 
+42
Apr 14 '16 at 9:31
source share

If you have ever used the Requests module for python before, I recently found out that the developer has created a new module called Requests-HTML which now also has the ability to render JavaScript.

You can also visit https://html.python-requests.org/ to learn more about this module, or if you are only interested in viewing JavaScript, you can visit https://html.python-requests.org/?#javascript -support to directly learn how to use the module to render JavaScript using Python.

Essentially, once you correctly install the Requests-HTML module, the following example, shown in the link above , shows how you can use this module to clean the website and display the JavaScript contained on the website:

 from requests_html import HTMLSession session = HTMLSession() r = session.get('http://python-requests.org/') r.html.render() r.html.search('Python 2 will retire in only {months} months!')['months'] '<time>25</time>' #This is the result. 

I recently learned about this from a video on YouTube. Click here! to watch a YouTube video demonstrating how the module works.

+17
Apr 16 '18 at 19:40
source share

This seems to be a good solution too, taken from a wonderful blog post.

 import sys from PyQt4.QtGui import * from PyQt4.QtCore import * from PyQt4.QtWebKit import * from lxml import html #Take this class for granted.Just use result of rendering. class Render(QWebPage): def __init__(self, url): self.app = QApplication(sys.argv) QWebPage.__init__(self) self.loadFinished.connect(self._loadFinished) self.mainFrame().load(QUrl(url)) self.app.exec_() def _loadFinished(self, result): self.frame = self.mainFrame() self.app.quit() url = 'http://pycoders.com/archive/' r = Render(url) result = r.frame.toHtml() # This step is important.Converting QString to Ascii for lxml to process # The following returns an lxml element tree archive_links = html.fromstring(str(result.toAscii())) print archive_links # The following returns an array containing the URLs raw_links = archive_links.xpath('//div[@class="campaign"]/a/@href') print raw_links 
+15
Apr 10 '16 at 20:12
source share

It looks like the data that you are really looking for can be accessed through a secondary URL called some javascript on the main page.

Although you can try running javascript on the server to handle this, a simpler approach would be to load the page using Firefox and use a tool like Charles or Firebug to determine exactly what the secondary URL is. Then you can simply request this URL directly for the data you are interested in.

+12
Nov 08 '11 at 11:23
source share

Selenium is best suited for clearing JS and Ajax content.

Check out this article to retrieve data from the web using Python

 $ pip install selenium 

Then download the Chrome web driver.

 from selenium import webdriver browser = webdriver.Chrome() browser.get("https://www.python.org/") nav = browser.find_element_by_id("mainnav") print(nav.text) 

Easy, right?

+10
Jan 18 '18 at 18:45
source share

You can also execute javascript using webdriver.

 from selenium import webdriver driver = webdriver.Firefox() driver.get(url) driver.execute_script('document.title') 

or save the value in a variable

 result = driver.execute_script('var text = document.title ; return var') 
+7
Mar 28 '17 at 16:45
source share

I personally prefer using scrapy and selenium and docking as in separate containers. Thus, you can install it with minimal problems, as well as bypassing modern websites that almost all contain javascript in one form or another. Here is an example:

Use the scrapy startproject project to create your scraper and write a spider, the skeleton can be as simple as this:

 import scrapy class MySpider(scrapy.Spider): name = 'my_spider' start_urls = ['https://somewhere.com'] def start_requests(self): yield scrapy.Request(url=self.start_urls[0]) def parse(self, response): # do stuff with results, scrape items etc. # now were just checking everything worked print(response.body) 

The real magic happens in middlewares.py. Overwrite the two methods in the bootloader middleware __init__ and process_request as follows:

 # import some additional modules that we need import os from copy import deepcopy from time import sleep from scrapy import signals from scrapy.http import HtmlResponse from selenium import webdriver class SampleProjectDownloaderMiddleware(object): def __init__(self): SELENIUM_LOCATION = os.environ.get('SELENIUM_LOCATION', 'NOT_HERE') SELENIUM_URL = f'http://{SELENIUM_LOCATION}:4444/wd/hub' chrome_options = webdriver.ChromeOptions() # chrome_options.add_experimental_option("mobileEmulation", mobile_emulation) self.driver = webdriver.Remote(command_executor=SELENIUM_URL, desired_capabilities=chrome_options.to_capabilities()) def process_request(self, request, spider): self.driver.get(request.url) # sleep a bit so the page has time to load # or monitor items on page to continue as soon as page ready sleep(4) # if you need to manipulate the page content like clicking and scrolling, you do it here # self.driver.find_element_by_css_selector('.my-class').click() # you only need the now properly and completely rendered html from your page to get results body = deepcopy(self.driver.page_source) # copy the current url in case of redirects url = deepcopy(self.driver.current_url) return HtmlResponse(url, body=body, encoding='utf-8', request=request) 

Remember to enable this middleware by uncommenting the following lines in the settings.py file:

 DOWNLOADER_MIDDLEWARES = { 'sample_project.middlewares.SampleProjectDownloaderMiddleware': 543,} 

Further for docking. Create your Dockerfile from a light image (I use python Alpine here), copy its project directory to it, set the requirements:

 # Use an official Python runtime as a parent image FROM python:3.6-alpine # install some packages necessary to scrapy and then curl because it handy for debugging RUN apk --update add linux-headers libffi-dev openssl-dev build-base libxslt-dev libxml2-dev curl python-dev WORKDIR /my_scraper ADD requirements.txt /my_scraper/ RUN pip install -r requirements.txt ADD . /scrapers 

And finally, combine all this in docker-compose.yaml :

 version: '2' services: selenium: image: selenium/standalone-chrome ports: - "4444:4444" shm_size: 1G my_scraper: build: . depends_on: - "selenium" environment: - SELENIUM_LOCATION=samplecrawler_selenium_1 volumes: - .:/my_scraper # use this command to keep the container running command: tail -f /dev/null 

Run docker-compose up -d . If this is your first time doing this, it will take some time to get the latest selenium / stand-alone chrome and create a scraper image.

After that, you can check that your containers work with docker ps and also check that the name of the selenium container matches the name of the environment variable that we passed to our scraper container (here it was SELENIUM_LOCATION=samplecrawler_selenium_1 ).

Enter your scraper container using docker exec -ti YOUR_CONTAINER_NAME sh , the command docker exec -ti samplecrawler_my_scraper_1 sh , cd was executed for me in the right directory and run your scraper using scrapy crawl my_spider .

All this on my github page and you can get it from here.

+6
May 30 '18 at 19:21
source share

You want to use urllib, queries, the beautifulSoup and selenium web driver in your script for different parts of the page (to name a few).
Sometimes you get what you need with just one of these modules.
Sometimes you will need two, three or all of these modules.
Sometimes you need to disable js in your browser.
Sometimes you need header information in a script.
No websites can be cleaned the same way, and no website can be cleaned in the same way forever, without having to change your crawler, usually after a few months. But they can all be cleaned! Where there is, there is a way for sure.
If you need cleared data in the future, just clear everything you need and save it in brine .dat files.
Just keep searching how to try that with these modules, and copy and paste your errors into Google.

+5
Mar 28 '17 at 16:59
source share

The combination of BeautifulSoup and Selenium works great for me.

 from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from bs4 import BeautifulSoup as bs driver = webdriver.Firefox() driver.get("http://somedomain/url_that_delays_loading") try: element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "myDynamicElement"))) #waits 10 seconds until element is located. Can have other wait conditions such as visibility_of_element_located or text_to_be_present_in_element html = driver.page_source soup = bs(html, "lxml") dynamic_text = soup.find_all("p", {"class":"class_name"}) #or other attributes, optional else: print("Couldnt locate element") 

PS You can find more waiting conditions here.

+5
May 29 '18 at 22:29
source share

Using PyQt5

 from PyQt5.QtWidgets import QApplication from PyQt5.QtCore import QUrl from PyQt5.QtWebEngineWidgets import QWebEnginePage import sys import bs4 as bs import urllib.request class Client(QWebEnginePage): def __init__(self,url): global app self.app = QApplication(sys.argv) QWebEnginePage.__init__(self) self.html = "" self.loadFinished.connect(self.on_load_finished) self.load(QUrl(url)) self.app.exec_() def on_load_finished(self): self.html = self.toHtml(self.Callable) print("Load Finished") def Callable(self,data): self.html = data self.app.quit() #url = "" #client_response = Client(url) #print(client_response.html) 
+2
Jul 14 '18 at 16:44
source share



All Articles