Welcome to Government Gazette text mining, cross linking, and codification Project (or 3gm for short) using Natural Language Processing Methods and Practices on Greek Legislation.
This project aims to provide with the most recent versions of each law, i.e. an automated codex via NLP methods and practices.
We live in a complex regulatory environment. As citizens, we obey government regulations from many authorities. As members of organized societies and groups, we must obey organizational policies and rules. As social beings, we are bound by conventions we make with others. As individuals, they are bound by personal rules of conduct. The full number and size of regulations can be really scary. We can agree on some general principles but, at the same time, we can disagree on how these principles apply to specific situations. In order to minimize such disagreements, regulators are often obliged to create numerous regulations or very large regulations to deal with special cases.
In the recent years plenty of attention has been gathering around analyzing public sector texts via text mining methods enabled by modern libraries, algorithms and practices and bought to to the forefront by open source projects such as textblob, spaCy, SciPy, Tensorflow and NLTK. These collaborative productive efforts seem to be a shift towards more efficient understanding of natural language by machines which can be used in conjunction with public documents in order to provide useful tools for legislators. This emerging sector is usually referred as "Computational Law".
This project, developed under the auspices the Google Summer of Code 2018 Program, carries out the extraction of Government Gazette (ΦΕΚ) texts from the National Printing House (ET), cross-links them with each other and, finally, identifies and applies the amendments to the legal text by providing automatic codification of the Greek legislation using methods and techniques of Natural Language Processing. This will allow the elimination of bureaucratic procedures and great time savings for lawyers looking for the most recent versions of statutes in legal databases. The detection of amendments is automated in order to amend the amendments to the laws merged into a common law, a procedure known as codification of the law. The new "merged" / modified / codified laws can show the current text of a law at every moment. This is something that is being traditionally done by hand and our aim was to automate it.
Finally, the laws are clustered into topics according to their content using a non-supervised machine learning model (Latent Dirichlet Allocation) to provide a more holistic representation of Greek legislation. Also, for easier indexing, PageRank was used and therefore the interconnections of the laws were positively taken into account, because the more references there is a legislative text than the other the more important it is characterized.
Through the analysis, categorization and codification of the GG documents, this project facilitates key elements of everyday life such as the elimination of bureaucracy and the efficient management of public documents to implement tangible solutions, which allows huge savings for lawyers and citizens.
A presentation of the project is available here as part of FOSSCOMM 2018 at the University of Crete
The project is hosted at 3gm.ellak.gr or openlaws.ellak.gr. A video presentation of the project is available here.
You can view the detailed timeline here. What has been done during the program can be found in the Final Progress Report.
The project met and exceeded its goals for Google Summer of Code 2018. Link
Google Summer of Code participant: Marios Papachristou (papachristoumarios)
Organization: GFOSS - Open Technologies Alliance
- Mentor: Diomidis Spinellis (dsplinellis)
- Mentor: Sarantos Kapidakis
- Mentor: Marios Papachristou (papachristoumarios)
- Marios Papachristou (Original Developer)
- Theodore Papadopoulos
- Getting started
- Algorithms
- Datasets and Continuous Integration
- Documentation
- API Documentation
- RESTful API
- Help (for web application)
- Development
- The project is written in Python 3.x using the following libraries: spaCy, gensim, selenium, pdfminer.six, networkx, Flask_RESTful, Flask, pytest, numpy, pymongo, sklearn, pyocr, bs4, pillow and wand.
- The information is stored in MongoDB (document-oriented database schema) and is accessible through a RESTful API.
- The UI is based on angular 7
- Document parser can parse PDFs from Government Gazette Issues (see the
data
for examples). The documents are split into articles in order to detect amendments. - Parser for existing laws.
- Named Entities for Legal Acts (e.g. Laws, Legislative Decrees etc.) encoded in regular expressions.
- Similarity analyzer using topic models for finding Government Gazette Issues that have the same topics.
- We use an unsupervised model to extract the topics and then group Issues by topics for cross-linking between Government Gazette Documents. Topic modelling is done with the LDA algorithm as illustrated in the Wiki Page. The source code is located at
3gm/topic_models.py
. - There is also a Doc2Vec approach.
- We use an unsupervised model to extract the topics and then group Issues by topics for cross-linking between Government Gazette Documents. Topic modelling is done with the LDA algorithm as illustrated in the Wiki Page. The source code is located at
- Documented end-2-end procedure at Project Wiki
- MongoDB Integration
- Fetching Tool for automated fetching of documents from ET
- Parallelized tool for batch conversion of documents with pdf2txt (for newer documents) or Google Tesseract 4.0 (for performing OCR on older documents) with
pdfminer.six
,tesseract
andpyocr
- Digitalized archive of Government Gazette Issues from 1976 - today in PDF and plaintext format. Conversion of documents is done either via
pdfminer.six
ortesseract
(for OCR on older documents). - Web application written in Flask located at
3gm/app.py
hosted at 3gm.ellak.gr - RESTful API written in
flask-restful
for providing versions of the laws and - Unit tests integrated to Travis CI.
- Versioning system for laws with support for checkouts, rollbacks etc.
- Ranking of laws using PageRank provided by the
networkx
package. - Summarization Module using TextRank for providing summaries at the search results.
- Amendment Detection Algorithm. For example (taken from Greek Government Gazette):
Μετά το άρθρο 9Α του ν. 4170/2013, που προστέθηκε με το άρθρο 3 του ν. 4474/2017, προστίθεται άρθρο 9ΑΑ, ως εξής:
Main Body / Extract
Άρθρο 9ΑΑ
Πεδίο εφαρμογής και προϋποθέσεις της υποχρεωτικής αυτόματης ανταλλαγής πληροφοριών όσον αφορά στην Έκθεση ανά Χώρα
- Η Τελική Μητρική Οντότητα ενός Ομίλου Πολυεθνικής Επιχείρησης (Ομίλου ΠΕ) που έχει τη φορολογική της κατοικία στην Ελλάδα ή οποιαδήποτε άλλη Αναφέρουσα Οντότητα, σύμφωνα με το Παράρτημα ΙΙΙ Τμήμα ΙΙ, υποβάλλει την Έκθεση ανά Χώρα όσον αφορά το οικείο Φορολογικό Έτος Υποβολής Εκθέσεων εντός δώδεκα (12) μηνών από την τελευταία ημέρα του Φορολογικού Έτους Υποβολής Εκθέσεων του Ομίλου ΠΕ, σύμφωνα με το Παράρτημα ΙΙΙ Τμήμα ΙΙ.
The above text signifies the addition of an article to an existing law. We use a combination of heuristics and NLP from the spaCy package to detect the keywords (e.g. verbs, subjects etc.):
- Detect keywords for additions, removals, replacements etc.
- Detect the subject which is in nominative in Greek. The subject is also part of some keywords such as article (άρθρο), paragraph(παράγραφος), period (εδάφιο), phrase (φράση) etc. These words have a subset relationship which means that once the algorithm finds the subject it should look up for its predecessors. So it results in a structure like this:
- A Python dictionary is generated:
{'action': 'αντικαθίσταται', 'law': {'article': { '_id': '9AA', 'content': 'Πεδίο εφαρμογής και προϋποθέσεις της υποχρεωτικής αυτόματης ανταλλαγής πληροφοριών όσον αφορά στην Έκθεση ανά Χώρα 1. Η Τελική Μητρική Οντότητα ενός Ομίλου Πολυεθνικής Επιχείρησης (Ομίλου ΠΕ) που έχει τη φορολογική της κατοικία στην Ελλάδα ή οποιαδήποτε άλλη Αναφέρουσα Οντότητα, σύμφωνα με το Παράρτημα ΙΙΙ Τμήμα ΙΙ, υποβάλλει την Έκθεση ανά Χώρα όσον αφορά το οικείο Φορολογικό Έτος Υποβολής Εκθέσεων εντός δώδεκα (12) μηνών από την τελευταία ημέρα του Φορολογικού Έτους Υποβολής Εκθέσεων του Ομίλου ΠΕ, σύμφωνα με το Παράρτημα ΙΙΙ Τμήμα ΙΙ.'}, '_id': 'ν. 4170/2013'}, '_id': 14}
- And is translated to a MongoDB operation (in this case insertion into the database). Then the information is stored to the database.
For more information visit the corresponding Wiki Page
- Government Gazette Issues may not always follow guidelines.
- Improving heuristics.
- Gathering Information.
- Digitizing very old articles.
Development Mailing List: 3gm-dev@googlegroups.com
The project is opensourced as a part of the Google Summer of Code Program and Vision. Here, the GNU GPLv3 license is adopted. For more information see LICENSE.