I personally prefer using scrapy and selenium and docking as in separate containers. Thus, you can install it with minimal problems, as well as bypassing modern websites that almost all contain javascript in one form or another. Here is an example:
Use the scrapy startproject project to create your scraper and write a spider, the skeleton can be as simple as this:
import scrapy class MySpider(scrapy.Spider): name = 'my_spider' start_urls = ['https://somewhere.com'] def start_requests(self): yield scrapy.Request(url=self.start_urls[0]) def parse(self, response):
The real magic happens in middlewares.py. Overwrite the two methods in the bootloader middleware __init__ and process_request as follows:
# import some additional modules that we need import os from copy import deepcopy from time import sleep from scrapy import signals from scrapy.http import HtmlResponse from selenium import webdriver class SampleProjectDownloaderMiddleware(object): def __init__(self): SELENIUM_LOCATION = os.environ.get('SELENIUM_LOCATION', 'NOT_HERE') SELENIUM_URL = f'http://{SELENIUM_LOCATION}:4444/wd/hub' chrome_options = webdriver.ChromeOptions()
Remember to enable this middleware by uncommenting the following lines in the settings.py file:
DOWNLOADER_MIDDLEWARES = { 'sample_project.middlewares.SampleProjectDownloaderMiddleware': 543,}
Further for docking. Create your Dockerfile from a light image (I use python Alpine here), copy its project directory to it, set the requirements:
And finally, combine all this in docker-compose.yaml :
version: '2' services: selenium: image: selenium/standalone-chrome ports: - "4444:4444" shm_size: 1G my_scraper: build: . depends_on: - "selenium" environment: - SELENIUM_LOCATION=samplecrawler_selenium_1 volumes: - .:/my_scraper
Run docker-compose up -d . If this is your first time doing this, it will take some time to get the latest selenium / stand-alone chrome and create a scraper image.
After that, you can check that your containers work with docker ps and also check that the name of the selenium container matches the name of the environment variable that we passed to our scraper container (here it was SELENIUM_LOCATION=samplecrawler_selenium_1 ).
Enter your scraper container using docker exec -ti YOUR_CONTAINER_NAME sh , the command docker exec -ti samplecrawler_my_scraper_1 sh , cd was executed for me in the right directory and run your scraper using scrapy crawl my_spider .
All this on my github page and you can get it from here.