Skip to content

openwpm/openwpm-crawler

Repository files navigation

OpenWPM Crawler

Launch OpenWPM crawls using Kubernetes Job workloads or stand up some docker-compose services to run the crawl in a distributed fashion.

A Redis work queue is set up and loaded with the list of URLs to crawl.

Containers running either locally or in the cloud execute the OpenWPM crawler.py script which will continuously fetch sites to run and exit once there are no additional sites in the queue.

Preparations

To install all the required tools (using conda)

./install.sh
conda activate openwpm-crawler

Run a crawl locally (using Kubernetes)

See ./deployment/local/README.md.

Run a crawl in Google Cloud Platform

See ./deployment/gcp/README.md.

Run a crawl locally (using docker-compose)

See ./deployment/local-compose/README.md. This is the simplest option, requiring only docker-compose which is shipped with Docker on both Mac and Windows, however behaviour might slightly differ from cloud crawls.

Analyze crawl results

jupyter notebook

After launching Jupyter, navigate to analysis/Sample Analysis.ipynb and choose Kernel -> Change Kernel -> openwpm-crawler in the menu.