Scrapy is an open source framework for extracting data from websites. We use scrapy to crawl a list of webpages (fact-checker websites) from a CSV dataset.
- Input file as *.csv (E.g data_out.csv)
- Outputs data as a json file.
- Scrapy script run in batches of size ~3000. Larger batch size might result in failed HTTP requests. Therefore, divide the input data into smaller batches (see news_spider.py in scraper directory).
- Filter news article based on language (E.g en, sp, de)
- Requires Python 3+ (tested on Python 3.8.6).
- Python-env or virtualenv needs to be installed
The easy way is to run
make build
In case this command fails, follow the detailed instructions below:
- Create virtual environment:
python3 -m venv .env
orvirtualenv .env
(if using virtualenv package) - Activate virtual environment:
. .env/bin/activate
- Install pip dependencies:
pip install -r requirements.txt
- Place input file in data directory under
code/data_acquisition/scraper
. - Set batch to a reasonable size like ~3000. (see news_spider.py in scraper directory).
- Start crawling:
make run-crawler
orscrapy crawl news-crawler
- Output saved in data directory under
code/data_acquisition/scraper
.
- If python3-env is not installed:
sudo apt install python3-env