Background

I'm working on a project, which needs to use selenium to scrape google group where all the content is rendered using javascript. Writing a scraper isn't that hard. But we want to take the advantage of GitHub Action.

There are two essential parts when configuring selenium, which are Chrome application and chromedriver. chromedriver is just a executable binary file. I can just put it in the repository and call in the script. The problem is how to install the browser.

This Dockerfile shows how to install deb file. I haven't tried yet because docker-machine fails to run on my machine when I found this.

FROM python:3.8

# install google chrome
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
RUN apt-get -y update
RUN apt-get install -y google-chrome-stable

# install chromedriver
RUN apt-get install -yqq unzip
RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zip
RUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/

# set display port to avoid crash
ENV DISPLAY=:99

# upgrade pip
RUN pip install --upgrade pip

# install selenium
RUN pip install selenium

I found another docker-compose.yml. this is more convinient than the above method because with this all we need is set the command_executor to be http://127.0.0.1:4444/wd/hub and the browser can be dropped.

version: '2'
services:
  firefox:
    image: selenium/node-firefox:3.14.0-gallium
    volumes:
      - /dev/shm:/dev/shm
    depends_on:
      - hub
    environment:
      HUB_HOST: hub

  chrome:
    image: selenium/node-chrome:3.14.0-gallium
    volumes:
      - /dev/shm:/dev/shm
    depends_on:
      - hub
    environment:
      HUB_HOST: hub

  hub:
    image: selenium/hub:3.14.0-gallium
    ports:
      - '4444:4444'

GitHub Action