Scraping-orchestra

A scraping Master-slave system based on Google App Engine

This repository showcases an approach to orchestrate from a local process a Scraper deployed in Google App Engine. The proposal is a workaround to the HTTP 429 Too Many Requests Error. The main idea is to redeploy the Scraper to get a new IP whenever the Error shows up.

Medium article

Take a look at the article I published about this

System architecture

.

Running locally

To test this locally clone the repo and run:

pip install -r requirements.txt
python master.py in one terminal
gunicorn -b :8080 slave:app --timeout 360000 --preload in a different terminal

The output of the master looks like this.

.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
csv		csv
png		png
README.md		README.md
app.yaml		app.yaml
credentials.json		credentials.json
master.py		master.py
requirements.txt		requirements.txt
slave.py		slave.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraping-orchestra

Medium article

System architecture

Running locally

About

Releases

Packages

Languages

juanluisrto/Scraping-orchestra

Folders and files

Latest commit

History

Repository files navigation

Scraping-orchestra

Medium article

System architecture

Running locally

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages