Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In dataset Wmt17 zh-en,The result is not good as wmt14 en-de #108

Open
QiyaoHuang opened this issue Oct 13, 2021 · 2 comments
Open

In dataset Wmt17 zh-en,The result is not good as wmt14 en-de #108

QiyaoHuang opened this issue Oct 13, 2021 · 2 comments

Comments

@QiyaoHuang
Copy link

When I use the dataset wmt14en-de ,I got the bleu score:24.5,which is just like the paper's score,
but when I use the same way to train the model with Wmt17 zh-en,the bleu score is only 7.0.

the dataset Wmt17 zh-en:
http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz",
["training/news-commentary-v12.zh-en.en",
"training/news-commentary-v12.zh-en.zh"]]]
why
how can I do ?

@aseaday
Copy link

aseaday commented Nov 12, 2021

How do you tokenize the Chinese corpus?

@QiyaoHuang
Copy link
Author

你如何标记中文语料库?
使用本项目模板例子里提供的tokenize方式,和我在wmt14en-de上做法相同

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants