Skip to content

TUB-NLP-OpenData/intelligent_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Description

"A Web crawler, sometimes called spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for Web indexing (web spidering)." wikipedia.org

Problem

Finding a particular piece of information on different structures of data can still be a challenge in some cases. For instance, typically Newspapers employ different structures for their Articles, and the task of automating the extraction of structured information - e.g. the name of the author - must be adapted for each context.

Goal

Design a multilingual web-crawler capable of reading different structures of data and extract a particular piece of information.

Case of Use

Fact-checkers plays a vital role by providing a highly reputable source of information for training fake news detection models. However, the information is structured in different manners depending on the website. Applay this "intelligent Crawler" to automatically extract structured data from fact-checkers.

Data Pipeline

  1. Targeting Reliable Sources
    1. Refereces
    2. Definition of white list: reliable sources of information.
    3. International Fact-Checking Network’s code of principles
    4. Fields (url, name, language, country)
  2. (E) Data Extraction
    1. Refereces
    2. Scrapy - pipeline.
    3. What are the crawling modalities?
    4. How to deal with pagination (automatically)?
    5. How to filter out non-informative web page? - intelligent crawlers
  3. (T) Data Transformation
    1. Refereces
    2. How to tranform data into information?
    3. Goal: ML model for extraction of structured information from unstructured data (multi language).
    4. Training: URL & its structured fields.
    5. BERT - multilingual language model - Roberta.
  4. (L) Data Load
    1. Refereces
    2. Semantic Data Concepts.
    3. Suitable Ontology
    4. What are the storage technologies for Open Knowledge Base/Graph.
    5. Normalization / Disambiguation
    6. Linked Data / DBpedia, etc.

Research Problems addressed here

  1. How to create a machine readable KB?
    1. How to use ML methods to extract information from unstructured data?
    2. How to link a KB to existing Open Data Repositories (e.g. dbpedia.org)

Running Intelligent Crawler

Instructions to run Scrapy crawler can be found here - Scrapy Crawler readme

Information Extracted

From the fact-checker, we use only metadata publicly available on the internet, such as the claim, date, source, amongst others.

related projects