Skip to content

Latest commit

 

History

History
329 lines (280 loc) · 9.46 KB

task_schemas.md

File metadata and controls

329 lines (280 loc) · 9.46 KB

Nusantara Schema Documentation

We have defined a set of lightweight, task-specific schema to help simplify programmatic access to common nusantara-nlp datasets. This schema should be implemented for each dataset in addition to a schema that preserves the original dataset format.

Example Schema and Associated Tasks

Knowledge Base

Schema Template

This is a simple container format with minimal nesting that supports a range of common knowledge base construction / information extraction tasks.

  • Named entity recognition (NER)
  • Named entity disambiguation/normalization/linking (NED)
  • Event extraction (EE)
  • Relation extraction (RE)
  • Coreference resolution (COREF)
{
    "id": "ABCDEFG",
    "document_id": "XXXXXX",
    "passages": [...],
    "entities": [...],
    "events": [...],
    "coreferences": [...],
    "relations": [...]
}

Schema Notes

  • id fields appear at the top (i.e. document) level and in every sub-component (passages, entities, events, coreferences, relations). They can be set in any fashion that makes every id field in a dataset unique (including id fields in different splits like train/validation/test).
  • document_id should be a dataset provided document id. If not provided in the dataset, it can be set equal to the top level id.
  • offsets contain character offsets into the string that would be created from " ".join([passage["text"] for passage in passages])
  • offsets and text are always lists to support discontinous spans. For continuous spans, they will have the form offsets=[(lo,hi)], text=["text span"]. For discontinuous spans, they will have the form offsets=[(lo1,hi1), (lo2,hi2), ...], text=["text span 1", "text span 2", ...]
  • normalized sub-component may contain 1 or more normalized links to database entity identifiers.
  • passages captures document structure such as named sections.
  • entities,events,coreferences,relations may be empty fields depending on the dataset and specific task.

Passages

Passages capture document structure, such as the title and abstact sections of a PubMed abstract.

{
    "id": "0",
    "document_id": "227508",
    "passages": [
        {
            "id": "1",
            "type": "title",
            "text": ["Naloxone reverses the antihypertensive effect of clonidine."],
            "offsets": [[0, 59]],
        },
        {
            "id": "2",
            "type": "abstract",
            "text": ["In unanesthetized, spontaneously hypertensive rats the decrease in blood pressure and heart rate produced by intravenous clonidine, 5 to 20 micrograms/kg, was inhibited or reversed by nalozone, 0.2 to 2 mg/kg. The hypotensive effect of 100 mg/kg alpha-methyldopa was also partially reversed by naloxone. Naloxone alone did not affect either blood pressure or heart rate. In brain membranes from spontaneously hypertensive rats clonidine, 10(-8) to 10(-5) M, did not influence stereoselective binding of [3H]-naloxone (8 nM), and naloxone, 10(-8) to 10(-4) M, did not influence clonidine-suppressible binding of [3H]-dihydroergocryptine (1 nM). These findings indicate that in spontaneously hypertensive rats the effects of central alpha-adrenoceptor stimulation involve activation of opiate receptors. As naloxone and clonidine do not appear to interact with the same receptor site, the observed functional antagonism suggests the release of an endogenous opiate by clonidine or alpha-methyldopa and the possible role of the opiate in the central control of sympathetic tone."],
            "offsets": [[60, 1075]],
        },
    ],
}

Entities

"entities": [
    {
        "id": "3",
        "offsets": [[0, 8]],
        "text": ["Naloxone"],
        "type": "Chemical",
        "normalized": [{"db_name": "MESH", "db_id": "D009270"}]
    },
    ...
 ],

Events

"events": [
    {
        "id": "3",
        "type": "Reaction",
        "trigger": {
            "offsets": [[0,6]],
            "text": ["reacts"]
        },
        "arguments": [
            {
                "role": "theme",
                "ref_id": "5",
            }
            ...
        ],
    }
    ...
],

Coreferences

"coreferences": [
	{
	   "id": "32",
	   "entity_ids": ["1", "10", "23"],
	},
	...
]

Relations

"relations": [
    {
        "id": "100",
        "type": "chemical-induced disease",
        "arg1_id": "10",
        "arg2_id": "32",
        "normalized": []
    }
]

Question Answering

{
	"id": "0",
	"document_id": "24267510",
	"question_id": "55031181e9bde69634000014",
	"question": "Is RANKL secreted from the cells?",
	"type": "yesno",
	"choices": [],
	"context": "Osteoprotegerin (OPG) is a soluble secreted factor that acts as a decoy receptor for receptor activator of NF-\u03baB ligand (RANKL)",
	"answer": ["yes"],
}

Sequence Labeling

{
    {
    "id": "0",
    "tokens": [
        "Seorang",
        "penduduk",
        "yang",
        "tinggal",
        "dekat",
        "tempat",
        "kejadian",
        "mengatakan",
        ",",
        "dia",
        "mendengar",
        "suara",
        "tabrakan",
        "yang",
        "keras",
        "dan",
        "melihat",
        "mobil",
        "ambulan",
        "membawa",
        "orang-orang",
        "yang",
        "berlumuran",
        "darah",
        "."
    ],
    "labels": [
        "B-NND",
        "B-NN",
        "B-SC",
        "B-VB",
        "B-JJ",
        "B-NN",
        "B-NN",
        "B-VB",
        "B-Z",
        "B-PRP",
        "B-VB",
        "B-NN",
        "B-NN",
        "B-SC",
        "B-JJ",
        "B-CC",
        "B-VB",
        "B-NN",
        "B-NN",
        "B-VB",
        "B-NN",
        "B-SC",
        "B-VB",
        "B-NN",
        "B-Z"
    ]
}

Textual Entailment

{
	"id": "0",
	"document_id": "NULL",
	"premise": "Pluto rotates once on its axis every 6.39 Earth days;",
	"hypothesis": "Earth rotates on its axis once times in one day.",
	"label": "neutral",
}

Text Pairs

{
	"id": "0",
	"document_id": "NULL",
	"text_1": "Am I over weight (192.9) for my age (39)?",
	"text_2": "I am a 39 y/o male currently weighing about 193 lbs. Do you think I am overweight?",
	"label": 1,
}

Text to Text

{
	"id": "0",
	"text_1": "Pleasing God doesn"t mean that we must busy ourselves with a new set of "spiritual" activities\n",
	"text_2": "Menyenangkan Allah tidaklah berarti bahwa kita harus menyibukkan diri sendiri dengan berbagai aktivitas rohani\n",
	"text_1_name": "eng",
	"text_2_name": "ind"
}

Text

{
    "id": "0",
    "text": "meski masa kampanye sudah selesai , bukan berati habis pula upaya mengerek tingkat kedipilihan elektabilitas .",
    "labels": [
        "neutral"
    ]
}

Self-supervised pretraining

{
    "id": "0",
    "text": "Placeholder text. Will change to a real example soon."
}

Speech recognition

{
    {"id": "01-001",
    "path": ".cache/huggingface/datasets/downloads/extracted/ecbf4ad46b3db9b85aa9108272c39dc75a268b4c0b92f2827866ef17dea97585/01/01-001.wav",
    "audio": {
        "path": ".cache/huggingface/datasets/downloads/extracted/ecbf4ad46b3db9b85aa9108272c39dc75a268b4c0b92f2827866ef17dea97585/01/01-001.wav",
        "array": array([-0.0005188 , -0.00018311, -0.00021362, ..., -0.00018311, -0.00033569, -0.00015259], dtype=float32),
        "sampling_rate": 16000
    },
    "text": "hai selamat pagi apa kabar",
    "speaker": "01",
    "metadata": {"speaker_age": 25, "speaker_gender": "female"}}
}