Description

"A Web crawler, sometimes called spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for Web indexing (web spidering)." wikipedia.org

Problem

Finding a particular piece of information on different structures of data can still be a challenge in some cases. For instance, typically Newspapers employ different structures for their Articles, and the task of automating the extraction of structured information - e.g. the name of the author - must be adapted for each context.

Goal

Design a multilingual web-crawler capable of reading different structures of data and extract a particular piece of information.

Case of Use

Fact-checkers plays a vital role by providing a highly reputable source of information for training fake news detection models. However, the information is structured in different manners depending on the website. Applay this "intelligent Crawler" to automatically extract structured data from fact-checkers.

Data Pipeline

Targeting Reliable Sources
1. Refereces
2. Definition of white list: reliable sources of information.
3. International Fact-Checking Network’s code of principles
4. Fields (url, name, language, country)
(E) Data Extraction
1. Refereces
2. Scrapy - pipeline.
3. What are the crawling modalities?
4. How to deal with pagination (automatically)?
5. How to filter out non-informative web page? - intelligent crawlers
(T) Data Transformation
1. Refereces
2. How to tranform data into information?
3. Goal: ML model for extraction of structured information from unstructured data (multi language).
4. Training: URL & its structured fields.
5. BERT - multilingual language model - Roberta.
(L) Data Load
1. Refereces
2. Semantic Data Concepts.
3. Suitable Ontology
4. What are the storage technologies for Open Knowledge Base/Graph.
5. Normalization / Disambiguation
6. Linked Data / DBpedia, etc.

Research Problems addressed here

How to create a machine readable KB?
1. How to use ML methods to extract information from unstructured data?
2. How to link a KB to existing Open Data Repositories (e.g. dbpedia.org)

Running Intelligent Crawler

Instructions to run Scrapy crawler can be found here - Scrapy Crawler readme

Information Extracted

From the fact-checker, we use only metadata publicly available on the internet, such as the claim, date, source, amongst others.

related projects

https://github.com/BruceDone/awesome-crawler#python

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
code		code
data		data
docs		docs
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
data_description.ipynb		data_description.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Problem

Goal

Case of Use

Data Pipeline

Research Problems addressed here

Running Intelligent Crawler

Information Extracted

related projects

About

Releases 3

Packages

Contributors 4

Languages

TUB-NLP-OpenData/intelligent_crawler

Folders and files

Latest commit

History

Repository files navigation

Description

Problem

Goal

Case of Use

Data Pipeline

Research Problems addressed here

Running Intelligent Crawler

Information Extracted

related projects

About

Resources

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 4

Languages

Packages