Skip to content

A scraping Master-slave system based on Google App Engine

Notifications You must be signed in to change notification settings

juanluisrto/Scraping-orchestra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraping-orchestra

A scraping Master-slave system based on Google App Engine

This repository showcases an approach to orchestrate from a local process a Scraper deployed in Google App Engine. The proposal is a workaround to the HTTP 429 Too Many Requests Error. The main idea is to redeploy the Scraper to get a new IP whenever the Error shows up.

Medium article

Take a look at the article I published about this

System architecture

alt text.

Running locally

To test this locally clone the repo and run:

  • pip install -r requirements.txt
  • python master.py in one terminal
  • gunicorn -b :8080 slave:app --timeout 360000 --preload in a different terminal

The output of the master looks like this.

alt text.

About

A scraping Master-slave system based on Google App Engine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages