News Scraping

Description

Scrapy is an open source framework for extracting data from websites. We use scrapy to crawl a list of webpages (fact-checker websites) from a CSV dataset.

Overview

Input file as *.csv (E.g data_out.csv)
Outputs data as a json file.
Scrapy script run in batches of size ~3000. Larger batch size might result in failed HTTP requests. Therefore, divide the input data into smaller batches (see news_spider.py in scraper directory).
Filter news article based on language (E.g en, sp, de)

Prerequisites

Requires Python 3+ (tested on Python 3.8.6).
Python-env or virtualenv needs to be installed

Setup (Ubuntu)

The easy way is to run
make build
In case this command fails, follow the detailed instructions below:

Create virtual environment: python3 -m venv .env or virtualenv .env (if using virtualenv package)
Activate virtual environment: . .env/bin/activate
Install pip dependencies: pip install -r requirements.txt

Running the crawler

Place input file in data directory under code/data_acquisition/scraper.
Set batch to a reasonable size like ~3000. (see news_spider.py in scraper directory).
Start crawling: make run-crawler or scrapy crawl news-crawler
Output saved in data directory under code/data_acquisition/scraper.

Troubleshooting

If python3-env is not installed: sudo apt install python3-env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

News Scraping

Description

Overview

Prerequisites

Setup (Ubuntu)

Running the crawler

Troubleshooting

Files

README.md

Latest commit

History

README.md

File metadata and controls

News Scraping

Description

Overview

Prerequisites

Setup (Ubuntu)

Running the crawler

Troubleshooting