"A Web crawler, sometimes called spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for Web indexing (web spidering)." wikipedia.org
Finding a particular piece of information on different structures of data can still be a challenge in some cases. For instance, typically Newspapers employ different structures for their Articles, and the task of automating the extraction of structured information - e.g. the name of the author - must be adapted for each context.
Design a multilingual web-crawler capable of reading different structures of data and extract a particular piece of information.
Fact-checkers plays a vital role by providing a highly reputable source of information for training fake news detection models. However, the information is structured in different manners depending on the website. Applay this "intelligent Crawler" to automatically extract structured data from fact-checkers.
- Targeting Reliable Sources
- Refereces
- Definition of white list: reliable sources of information.
- International Fact-Checking Network’s code of principles
- Fields (url, name, language, country)
- (E) Data Extraction
- Refereces
- Scrapy - pipeline.
- What are the crawling modalities?
- How to deal with pagination (automatically)?
- How to filter out non-informative web page? - intelligent crawlers
- (T) Data Transformation
- Refereces
- How to tranform data into information?
- Goal: ML model for extraction of structured information from unstructured data (multi language).
- Training: URL & its structured fields.
- BERT - multilingual language model - Roberta.
- (L) Data Load
- Refereces
- Semantic Data Concepts.
- Suitable Ontology
- What are the storage technologies for Open Knowledge Base/Graph.
- Normalization / Disambiguation
- Linked Data / DBpedia, etc.
- How to create a machine readable KB?
- How to use ML methods to extract information from unstructured data?
- How to link a KB to existing Open Data Repositories (e.g. dbpedia.org)
Instructions to run Scrapy crawler can be found here - Scrapy Crawler readme
From the fact-checker, we use only metadata publicly available on the internet, such as the claim, date, source, amongst others.