PTOIE

A pt-br OpenIE corpus and model.

Setup Project:

First install python 3.9 and run pip install poetry.
Open the project, set a poetry python interpreter and run poetry update on venv terminal.
After that, you need to wait for poetry to install all the needed packages.

Train and Pred Usage:

Training the model

For training the model, first get your data ready, with train, dev and test splits. run the following command(you can replace with your own parameters):

python3 train.py (max_epochs) (model_name) (train_file) (test_file) (dev_file)

example1:

python3 train.py 150 PTOIE datasets/saida_match PTOIE_train.txt PTOIE_test.txt PTOIE_dev.txt

Making predictions

For predicting, at first, you need a trained model, so if you didnt trained any model, back on the training step. If you have your trained model, just run:

Python3 predict.py (model_name) (sentence)

example2:

Python3 predict.py PTOIE "A Matemática é uma ciência que utiliza conceitos e técnicas para a formação de conhecimentos abstratos e concretos."

Working example

|  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  |
Extração:  Os cachorros são os melhores amigos do homem
Extração:  Os cachorros são mamiferos
|  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  |


|  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  |
|  ------------------------------------------------------------------------------------------------ MAIS INFO -------------------------------------------------------------------------------------------------  |
|  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  |
|  sentença:                                                                                                                                                                                                     |
|   "Os cachorros , que são mamiferos , são os melhores amigos do homem ."                                                                                                                                       |
|  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  |
|  extrações:                                                                                                                                                                                                    |
|   ["Os cachorros"/ARG0, "são"/V, "mamiferos"/ARG1, "são"/V, "os melhores amigos do homem"/ARG1]                                                                                                                |
|  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  |
|  probs:                                                                                                                                                                                                        |
|  [Span[0:2]: "Os cachorros" → ARG0 (0.8672), Span[4:5]: "são" → V (0.6802), Span[5:6]: "mamiferos" → ARG1 (0.3477), Span[7:8]: "são" → V (0.9333), Span[8:13]: "os melhores amigos do homem" → ARG1 (0.5932)]   |
|  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  |

Evaluation

You can run the eval script with the trained model:

python3 eval.py (model_dir) (output_txt_name) (corpus_dir) (train_path) (test_path) (dev_path)

example5:

python3 eval.py train_output/PTOIE PTOIE_eval datasets/saida_match PTOIE_train.txt PTOIE_test.txt PTOIE_dev.txt

output:

Results:
- F-score (micro) 0.6129
- F-score (macro) 0.6132
- Accuracy 0.4425

By class:
              precision    recall  f1-score   support

        ARG1     0.4731    0.5163    0.4938       153
           V     0.7453    0.7895    0.7668       152
        ARG0     0.5972    0.5621    0.5791       153

   micro avg     0.6038    0.6223    0.6129       458
   macro avg     0.6052    0.6226    0.6132       458
weighted avg     0.6049    0.6223    0.6129       458

Dataset tools

I made available some tools I wrote to convert and create a conll format dataset, feel free to use, some of them is not user friendly to use, but you can run the main.py to prepare a conll with the labels ready to train with:

The following command only works with the PTOIE dataset, if you want to use your own dataset, see explanation of example4.

python3 datasets/main.py (output_name) (json_dir) (input_dir) (test_size) (dev_size)

example3:

PTOIE saida_match/json_dump.json PTOIE/PTOIE.txt 0.1 0.1

If you want to create a conll of your own dataset, you need a json on the following format example:

{"0": 
  {"Id": 0,
   "sent": " A universidade \u00e9 a sede principal da Congrega\u00e7\u00e3o da Santa Cruz (embora n\u00e3o seja sua sede oficial, que fica em Roma).",
   "ext": [{"arg1": "A sede da congrega\u00e7\u00e3o da santa cruz  ", "rel": "   fica  ", "arg2": "  em Roma"}]},
"1": 
  {"Id": 1,
   "sent": " A universidade \u00e9 afiliada \u00e0 Congrega\u00e7\u00e3o da Santa Cruz (em latim Congregatio a Sancta Cruce, p\u00f3s-nominais abreviados \"CSC\").",
   "ext": [{"arg1": "A congrega\u00e7\u00e3o da santa cruz em latim  ", "rel": "   \u00e9  ", "arg2": "  Congregatio a Sancta Cruce"}]},

With the json file, run the following command

python3 datasets/main.py (output_name) (test_size) (dev_size) (json_dir) ""

example4:

python3 datasets/main.py PTOIE 0.1 0.1 saida_match/json_dump.json ""

Authors

A. Rios
B. Cabral
D. B. Claro
R. C. Araujo
M. Souza

How to cite

If you find this repo helpful, please consider citing:

@inproceedings{RIOS2024-TransAlign,
title={TransAlign: An Automated Corpus Generation through Cross-Linguistic Data Alignment for Open Information Extraction},
author={A. Rios and B. Cabral and D. B. Claro and R. C. Araujo and M. Souza},
booktitle={Proceedings of the International Conference on Computational Processing of Portuguese (PROPOR 2024)},
year={2024},
volume={1},
address={Santiago de Compostela},
organization={International Conference on Computational Processing of Portuguese (PROPOR 2024)}
}

A. Rios, B. Cabral, D. B. Claro, R. C. Araujo, M. Souza: TransAlign: An Automated Corpus Generation through Cross-Linguistic Data Alignment for Open Information Extraction. In: Proceedings of the International Conference on Computational Processing of Portuguese (PROPOR 2024), Santiago de Compostela, 2024, vol. 1.

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
.idea		.idea
OIE		OIE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PTOIE

Setup Project:

Train and Pred Usage:

Training the model

Making predictions

Working example

Evaluation

Dataset tools

Authors

How to cite

About

Releases

Packages

Languages

alanrios2001/PORT_NOIE

Folders and files

Latest commit

History

Repository files navigation

PTOIE

Setup Project:

Train and Pred Usage:

Training the model

Making predictions

Working example

Evaluation

Dataset tools

Authors

How to cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages