changedetection.io – be aware of potentially heavy traffic with selenium webdriver

changedetection.io with selenium webdriver is a very powerful tool to monitor for website changes. Just wanted to share a warning for other who are potentially not aware of the mighty traffic this great tool can generate with even a handful of urls.

Why use changedetection.io with selenium webdriver selenium webdriver

Couple days ago I had a use case where I wanted to generate some realistic traffic to a website I work on – to trigger some scenario in code. Nothing nefarious – it’s just curl wouldn’t work for this test as real browser was needed (to execute javascript as well). And I couldn’t just open a browser and refresh the page 10 times – as site stats data was being refreshed every 5 minutes (by design). Which is why I needed a constant stream of realistic traffic (with javascript execution).

I initially wanted to use selenium cluster/grid – but I haven’t had time to set it up properly. However I already had a working docker-compose.yml for changedetection.io.

For those of you that don’t know – changedetection.io is a great little tool to track changes on websites automatically and send notifications on change (it’s great for tracking amazon urls for example – if you are watching some items for price drops). And it supports selenium webdriver natively! – so I figured it would do just fine as a replacement for selenium grid as I only needed to load 10-12 urls into it (and indeed- it worked well and as advertised for this except for the traffic fiasco).

I typically would roll it out on some remote server/vps with near unlimited traffic, but to save time (also chrome webdriver hardware requirements are no joke – especially with concurrency) – I decided to spin it up on a local 8c/16t ryzen home server – and that was my mistake.

If you live in US – internet providers typically cap residential broadband. Some refer to this as predatory data caps – for example my home residential internet is capped with 1 TB of traffic a month and most US providers have similar limits in US. Most consumers simply have no alternative internet providers to run to -which is what makes these caps possible. And I guess I just dint realize just how much traffic my little test would generate lol.

My setup – 12 urls checked every 1 to 4 minutes

I’ve setup changedetection.io to check only 12 urls – but to check them fairly often – between 1 to 4 minutes for each url.

That number and frequency was needed to trigger a certain scenario in website I was working on at the moment.

Because website I was testing is so heavily laden with ads (which means webdriver could hit multiple timeouts and would hang for a while) – It took some modifications to make sure that one webdriver could handle several requests in parallel.

I had to increase max number of concurrent sessions to 16:

  #Increasing session concurrency per container
  - SE_NODE_MAX_SESSIONS=16
  - SE_NODE_OVERRIDE_MAX_SESSIONS=true

And also reduce the timeout interval and number of retries – otherwise webdriver would be hanging due to heavy ads:

  # improve timeouts - otherwise ads gonna be loading forever:
  - SE_SESSION_REQUEST_TIMEOUT=15
  - SE_SESSION_RETRY_INTERVAL=2
changedetection selenium webdriver - concurrency and parallelizaton

Here’s the full changedetection.io docker-compose.yml setup I’ve used

Here’s the docker-compose.yml:

version: "3.7"

networks:
  default:
    external: true
    name: infranet

services:
  haproxy:
    image: haproxy:lts-alpine
    container_name: ${PROJECT_NAME}-haproxy
    logging:
      driver: "json-file"
      options:
          max-file: "5"
          max-size: "10m"
    restart: always
    depends_on:
      - changedetection
    ports:
      - ${HOST1}:${CNT1}
    volumes:
      - ${CERTS_DIR}:/certs
      - ${FS_DIR}/haproxy/config:/usr/local/etc/haproxy:ro

  changedetection:
    image: ghcr.io/dgtlmoon/changedetection.io
    container_name: ${PROJECT_NAME}-app
    logging:
      driver: "json-file"
      options:
          max-file: "5"
          max-size: "10m"
    restart: always
    environment:
      PUID: ${PUID}
      PGID: ${PGID}
      WEBDRIVER_URL: http://${PROJECT_NAME}-seleniumchrome:4444/wd/hub
    volumes:
      - ${FS_DIR}/datastore:/datastore

  browser-chrome:
    shm_size: 8gb
    container_name: ${PROJECT_NAME}-seleniumchrome
    hostname: ${PROJECT_NAME}-seleniumchrome
    image: selenium/standalone-chrome:latest
    environment:
      - VNC_NO_PASSWORD=1
      - SCREEN_WIDTH=1920
      - SCREEN_HEIGHT=1080
      - SCREEN_DEPTH=24
      #Increasing session concurrency per container
      - SE_NODE_MAX_SESSIONS=${SE_NODE_MAX_SESSIONS}
      - SE_NODE_OVERRIDE_MAX_SESSIONS=true
      # add timeouts - otherwise ads gonna be loading forever:
      - SE_SESSION_REQUEST_TIMEOUT=15
      - SE_SESSION_RETRY_INTERVAL=2
    volumes:
      # Workaround to avoid the browser crashing inside a docker container
      # See https://github.com/SeleniumHQ/docker-selenium#quick-start
      - /dev/shm:/dev/shm
    restart: always

And here’s the .env file I was using along with it:

#general settings
COMPOSE_PROJECT_NAME=changedetection
PROJECT_NAME=changedetection
TZ=America/New_York
FS_DIR=/mnt/1tb_drive/dockervolumes/changedetection
PGID=1000
PUID=1000
CURRENT_UID=1000:1000
ENV_NAME=hs-edge1

#PORTS
HOST1=31740
CNT1=80

#SELENIUM
SE_NODE_MAX_SESSIONS=16

#HAPROXY
CERTS_DIR=/mnt/1tb_drive/dockervolumes/dnsrobocert/letsencrypt/live/example.com

And haproxy config for the most curious – if you want to set up changedetection.io yourself:

global
    # Settings under global define process-wide security and performance tunings that affect HAProxy at a low level.

    # Max number of connections haproxy will accept
    maxconn 1024

    # Logging to stdout  preferred when running as a container.
    log stdout format raw local0

    # Only TLS version 1.2 and newer is allowed:
    ssl-default-bind-options ssl-min-ver TLSv1.2


defaults
    # Defaults here
    # As your configuration grows, using a defaults section will help reduce duplication. 
    # Its settings apply to all of the frontend and backend sections that come after it. 
    # You’re still free to override those settings within the sections that follow.
    
    # this updates different proxies (frontend, backend, and listen sections) to send messages 
    # to the logging mechanism/server(s) configured in the global section
    log global

    # Will enable more verbose HTTP logging
    # Enable http logging format to include more details logs
    option	httplog

    # Enable HTTP connection closing on the server side but support keep-alive with clients
    # (This provides the lowest latency on the client side (slow network) and the fastest session reuse on the server side)
    option  http-server-close
    # option 	httpclose
    # Don't use httpclose and http-server-close, httpclose will disable keepalive on the client side


    # Expect HTTP layer 7, rather than load-balance at layer 4 
    mode    http
    
    # A connection on which no data has been transferred will not be logged (such as monitor probes)
    option	dontlognull

    # Various response timeouts
    timeout connect 5s
    timeout client 20s
    timeout server 45s





frontend fe-app-combined
    mode tcp
    bind *:80
    tcp-request inspect-delay 2s
    tcp-request content accept if HTTP
    tcp-request content accept if { req.ssl_hello_type 1 }
    use_backend be-app-recirc-http if HTTP
    default_backend be-app-recirc-https

backend be-app-recirc-http
    mode tcp
    server loopback-for-http abns@app-haproxy-http send-proxy-v2

backend be-app-recirc-https
    mode tcp
    server loopback-for-https abns@app-haproxy-https send-proxy-v2

frontend fe-app-https
    mode http
    bind abns@app-haproxy-https accept-proxy ssl crt /certs/fullkeychain.pem alpn h2,http/1.1
    # whatever you need todo for HTTPS traffic
    default_backend be-app-real

frontend fe-app-http
    mode http
    bind abns@app-haproxy-http accept-proxy
    # whatever you need todo for HTTP traffic
    redirect scheme https code 301 if !{ ssl_fc }

backend be-app-real
    mode http
    balance roundrobin

    # Enable insertion of the X-Forwarded-For header to requests sent to servers
    option forwardfor
    #https://serverfault.com/questions/743842/add-haproxy-x-forwarded-host-request-header
    http-request set-header X-Forwarded-Host %[req.hdr(Host)]
    http-request set-header X-Forwarded-Port %[dst_port]
    http-request add-header X-Forwarded-Proto https if { ssl_fc }

    # Send these request to check health
    option httpchk
    http-check send meth GET uri / ver HTTP/1.1 hdr Host haproxy.local
    http-check expect status 200-399

    server app-backend1 changedetection-app:5000 check

Cox email overage alert

And then I I just left it running – because “why not” I thought.

I was alerted from my slumber with this cox email alert just a couple days later.

After initially thinking on my youtube addiction, or perhaps some heavy other usage – one glance at router traffic analyzer pointed to my amd ryzen homelab server being a traffic hog and then one look at iptraf additionally removed all doubts and pointed to changedetection.io decisively.

I turned off all my changedetection.io tests immediately after while screaming from horror and late disbelief.

Results: Cox daily internet usage chart

I’ve enabled above selenium test some time on October 11th or 12th – and the traffic graph below shows very evident jump on around then.

And below graph is the result (not sure about spike on 10/13 – could be that I was messing around with my tests more heavily then usually).

Difference is fairly dramatic – and it doubles or triples my usual (already heavy) daily traffic. I could easily cross my internet cap (which is 1Tb a month) in a little more then 10 days, and this is with just 12 urls being tested with chrome webdriver. I don’t know about you – but I certainly underappreciated the traffic it takes to fully load a heavy website with heavy javascript and ads that are then constantly downloading more stuff.

I have no problems updating to unlimited data package or paying overage charge, but not everyone is perhaps that fortunate and the amount of traffic one can incur from such a seemingly innocent test is staggering. Cox overage charges as of October 18th 2022 are 10$ per 50Gb so depending on when you caught that – your bill could bite you.

Lessons learned:

  • Selenium webdriver could generate massive amount of traffic – especially if the website you are testing loads a lot of ads and has a lot of images. Make sure you use a real hosted remote server or a vps, as they typically come with unlimited 1Gbps link or at least some heavy traffic packages (8-10 Tb). Or otherwise make sure your residential internet line is uncapped.
  • This also seems like a no-brainer issue to me – support legislation banning data caps

Just figured I share my experience and thanks for reading!

Leave a Comment