Skip to content

HandongSF/KoreanUnificationParallelCorpus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Korean Parallel Corpus

  1. What is KPC?
  2. Data
  3. Translation Model Experiments
  4. Change Log
  5. Contributors
  6. Contact
  7. Citation

1) What is KPC?

Korean is the official language of both South Korea and North Korea. Despite sharing the same language, the North Korean and South Korean language differ in various linguistic aspects such as vocabulary, grammar, and spelling. The ongoing separation between North Korea and South Korea has widened the differences between the two languages. This language gap can become a major communication obstacle after Korean reunification.

Therefore, it is important to conduct research on how to bridge the gap between the North Korean and South Korean languages. One example would be to develop a North and South Korean translator. However, it is difficult to find a North Korean language dataset that has a corresponding South Korean language dataset. The lack of a North Korean and South Korean parallel corpus hinders active investments in machine translation of the North Korean language.

To address this issue, the Korean Unification Parallel Corpus (KPC) repository has been created. Its main goal is to provide a high-quality North and South Korean parallel corpus and make it available to the public. The KPC also explains how to use the parallel corpus for research, particularly in the field of machine translation.

1-1) Sources

The dataset contains 130,738 rows covering a range of topics of classical novel and the Bible. The classical novels are

1-2) Data Selection

Criteria for Selection

  • Data must be actually existing in South and North Korea.
  • Data must be accurately matched as sentence pairs.

Data Acquisition

  • Bible: The Bible is translated into many languages, divided into chapters and verses, with consistent content across verses, making it useful for matching.
  • Classic novels: Classic novels are translated into various languages and with translations available in both South Korean and North Korean.

2) Data

2-1) List

Category Book Total Row
Classic Novels Foreign Jane Eyre 60,331 94,459 (72%)
The Red and the Black 34,128
Korean Onggojip-jeon 988 6,293 (5%)
Sukhyang-jeon 3,538
Shimchung-jeon 1,767
Bible - - 29,986 (23%)
Total - - 130,738 (100%)

The dataset consists of classic novels and the Bible. The classic novel data is divided into two types of foreign novels and three types of Korean novels, each based on single data from North Korean publishers and multiple data from South Korean publishers. Consequently, the classic novel data collected a total of 100,752 North Korean-South Korean sentence pairs. The Bible data was collected in the same manner, resulting in a total of 29,986 data points. Thus, a total of 130,738 parallel corpora were constructed based on South Korean standards. Among these, the maximum number of characters per sentence is 286, and the minimum is 2.

2-2) Examples

nk sk
안해는 남편앞에 무릎을 꿇고 그를 붙들어두려고 하면서 부르짖었다. 부인은 남편 앞에 무릎을 꿇고 그를 붙잡으려고 애쓰면서 소리쳤다.
나는 창가림을 드리우고 난로가에 되돌아왔다. 나는 커튼을 내리고 난롯가로 되돌아갔다.

3) Translation Model Experiments

3-1) Experimental Settings

Foundation Model

KoBART (Korean BART) was used as the foundation translation model. KoBART was developed by the SKT AI team.

Training

We trained a North Korean(NK) → South Korean(SK) translation model and a South Korean(SK) → North Korean(NK) translation model. The training was conducted on 90% of all the 13,0738 rows of classic novels and bible data. The remaining 10% was used as the test data.

The data was split into a 9:1 ratio for training and testing. For foreign novel data, since each book is based on single data from North Korean publishers and multiple data from South Korean publishers, the same North Korean sentences are repeated as many times as the number of publications from South Korean publishers. Thus, caution was taken to ensure that North Korean sentences in the test data did not exist in the training data.

For Jane Eyre, a certain number of rows were randomly selected from the North Korean data, and the corresponding North-South Korean sentence pairs were extracted as test data, while the remaining sentence pairs were used as training data.

train test
Count 117,665 13,073
Size 9.9MB 961KB

For the training process, the hyperparameters were set as follows.

NK → SK model SK → NK model
batch size 4 4
epoch 8 8
learning rate 3e-5 3e-5
optimizer AdamW AdamW

3-2) Experimental Results

The evaluation metrics used on the test data set are the BLEU score and BERT Score. The below table presents the BLEU Score and BERT Score of the North Korean(NK) → South Korean(SK) translation model and the South Korean(SK) → North Korean(NK) translation model.

cf. BERT Score computes precision, recall, and F1-score. For simplicity, only the F1-score is presented in the table.

NK → SK model SK → NK model
BLEU Score 0.55 0.25
BERT Score 0.821 0.815

4) Change Log

[v1.0]
2024-03-27
A total of 130,738 North Korean-South Korean sentence pairs uploaded.

5) Contributors

6) Contact

7) Citation

If you use Korean Parallel Corpus (KPC), please cite the following paper and star this repository:

@inproceedings{chun2024paclic,
      title="Bridging the Linguistic Divide: Developing a North-South Korean Parallel Corpus for Machine Translation", 
      author={Hannah H.
 Chun and Chanju Lee and Hyunkyoo Choi and Charmgil Hong},
      booktitle = "Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation",
      month = dec,
    year = "2024",
    address = "Tokyo, Japan",
    publisher = "Association for Computational Linguistics",
}

KPC is licensed under GNU Free Documentation License (GFDL).

References

About

South and North Korean Parallel Corpus

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published