Add okapi task #3

liyier90 · 2024-06-25T03:34:24Z

Add Okapi tasks:

arc
hellaswag
mmlu

for the following languages:

id
ta
vi
zh

Implemented a custom sampler to keep mlmm-evaluation's behavior of drawing num_fewshot + 1 samples for fewshot examples.

Modification to mlmm-evaluation dev branch

Change https://github.com/aisingapore/mlmm-evaluation/blob/dev/lm_eval/evaluator.py#L58-L59 to
```
random.seed(0)
np.random.seed(1234)
torch.manual_seed(1234)
```
to match the seeds used by lm-evaluation-harness
Change https://github.com/aisingapore/mlmm-evaluation/blob/dev/lm_eval/evaluator.py#L187-L189 to
```
rnd = random.Random(1234)
```
to avoid pre-shuffling the dataset and to match the random seed used to initialize the Sampler in lm-evaluation-harness https://github.com/aisingapore/lm-evaluation-harness/blob/add-new-task/lm_eval/api/task.py#L674-L678
(Optional) Change https://github.com/aisingapore/mlmm-evaluation/blob/dev/lm_eval/evaluator.py#L341 to
```
for task_name, _, _ in task_dict_items:
```
to enable --write_out argument

Modifications to lm-evaluation-harness

Comment out https://github.com/aisingapore/lm-evaluation-harness/blob/dev/lm_eval/utils.py#L837 to avoid shuffling the dataset

Replicating results

m_hellaswag

Had to recreate the dataset https://huggingface.co/datasets/aisingapore/m_hellaswag as the zh subset does not work on the existing dataset on Hugging Face https://huggingface.co/datasets/alexandrainst/m_hellaswag

Command

lm-evaluation-harness

task_name='hellaswag_mlmm_${lang}'
lm_eval \
    --model hf \
    --model_args "pretrained=facebook/opt-125m,trust_remote_code=True" \
    --tasks "${task_name}" \
    --device cuda:0 \
    --batch_size 1 \
    --output_path "results/${task_name}.json" \
    --verbosity DEBUG \
    --log_samples

mlmm-evaluation

task_name="hellaswag_${lang}"
python main.py \
    --model_args "pretrained=facebook/opt-125m,trust_remote_code=True,use_accelerate=True" \
    --tasks "${task_name}" \
    --output_path "results/${task_name}.json" \
    --no_cache

Output

Task	Framework	Metric	Value	Stderr
hellaswag_mlmm_id	lm-evaluation-harness	acc	0.2680910457375993	0.004590128916219268
		acc_norm	0.295361820914752	0.004727325016374767
hellaswag_id	mlmm-evaluation	acc	0.2680910457375993	0.004590128916219268
		acc_norm	0.295361820914752	0.004727325016374767
hellaswag_mlmm_ta	lm-evaluation-harness	acc	0.24533460121240935	0.004691448887127335
		acc_norm	0.282182336859622	0.0049070711061398155
hellaswag_ta	mlmm-evaluation	acc	0.24533460121240935	0.004691448887127335
		acc_norm	0.282182336859622	0.0049070711061398155
hellaswag_mlmm_vi	lm-evaluation-harness	acc	0.27406679764243613	0.004660205855905225
		acc_norm	0.29971621916612096	0.004786529226748529
hellaswag_vi	mlmm-evaluation	acc	0.27406679764243613	0.004660205855905225
		acc_norm	0.29971621916612096	0.004786529226748529
hellaswag_mlmm_zh	lm-evaluation-harness	acc	0.2691560543924023	0.004607779535508573
		acc_norm	0.30110079861860567	0.0047658515880019915
hellaswag_zh	mlmm-evaluation	acc	0.2691560543924023	0.004607779535508573
		acc_norm	0.30110079861860567	0.0047658515880019915

m_arc

Task on lm-evaluation-harness main branch does not have num_fewshot set

Command

lm-evaluation-harness

task_name='arc_mlmm_${lang}'
lm_eval \
    --model hf \
    --model_args "pretrained=facebook/opt-125m,trust_remote_code=True" \
    --tasks "${task_name}" \
    --device cuda:0 \
    --batch_size 1 \
    --output_path "results/${task_name}.json" \
    --verbosity DEBUG \
    --log_samples

mlmm-evaluation

task_name="arc_${lang}"
python main.py \
    --model_args "pretrained=facebook/opt-125m,trust_remote_code=True,use_accelerate=True" \
    --tasks "${task_name}" \
    --output_path "results/${task_name}.json" \
    --no_cache

Output

Task	Framework	Metric	Value	Stderr
arc_mlmm_id	lm-evaluation-harness	acc	0.18632478632478633	0.011388161139389627
		acc_norm	0.2341880341880342	0.01238614525915231
arc_id	mlmm-evaluation	acc	0.18632478632478633	0.011388161139389627
		acc_norm	0.2341880341880342	0.01238614525915231
arc_mlmm_ta	lm-evaluation-harness	acc	0.2075306479859895	0.012005756657930959
		acc_norm	0.2530647985989492	0.012871065809204451
arc_ta	mlmm-evaluation	acc	0.2075306479859895	0.012005756657930959
		acc_norm	0.2530647985989492	0.012871065809204451
arc_mlmm_vi	lm-evaluation-harness	acc	0.19401709401709402	0.011565799456270379
		acc_norm	0.22393162393162394	0.0121927158461456
arc_vi	mlmm-evaluation	acc	0.19401709401709402	0.011565799456270379
		acc_norm	0.22393162393162394	0.0121927158461456
arc_mlmm_zh	lm-evaluation-harness	acc	0.21367521367521367	0.011988664332858041
		acc_norm	0.23846153846153847	0.01246372437380267
arc_zh	mlmm-evaluation	acc	0.21367521367521367	0.011988664332858041
		acc_norm	0.23846153846153847	0.01246372437380267

m_mmlu

Task on lm-evaluation-harness main branch uses first_n few shot sampling which differ from mlmm-evaluation's implementation

Command

lm-evaluation-harness

task_name='mmlu_mlmm_${lang}'
lm_eval \
    --model hf \
    --model_args "pretrained=facebook/opt-125m,trust_remote_code=True" \
    --tasks "${task_name}" \
    --device cuda:0 \
    --batch_size 1 \
    --output_path "results/${task_name}.json" \
    --verbosity DEBUG \
    --log_samples

mlmm-evaluation

task_name="mmlu_${lang}"
python main.py \
    --model_args "pretrained=facebook/opt-125m,trust_remote_code=True,use_accelerate=True" \
    --tasks "${task_name}" \
    --output_path "results/${task_name}.json" \
    --no_cache

Output

Task	Framework	Metric	Value	Stderr
mmlu_mlmm_id	lm-evaluation-harness	acc	0.24982825738493245	0.0037823828185448768
		acc_norm	0.27127700175559116	0.0038846516351237013
mmlu_id	mlmm-evaluation	acc	0.24982825738493245	0.0037823828185448768
		acc_norm	0.27127700175559116	0.0038846516351237013
mmlu_mlmm_ta	lm-evaluation-harness	acc	0.23364083110612985	0.012005756657930959
		acc_norm	0.25096991119924134	0.004025954925253322
mmlu_ta	mlmm-evaluation	acc	0.23364083110612985	0.003929153519886986
		acc_norm	0.25096991119924134	0.004025954925253322
mmlu_mlmm_vi	lm-evaluation-harness	acc	0.24291838922064002	0.003929153519886986
		acc_norm	0.265426427805849	0.0038636832594341887
mmlu_vi	mlmm-evaluation	acc	0.24291838922064002	0.011565799456270379
		acc_norm	0.265426427805849	0.0038636832594341887
mmlu_mlmm_zh	lm-evaluation-harness	acc	0.23691606532472465	0.003705863972342719
		acc_norm	0.2614508165590581	0.0038299294639917887
mmlu_zh	mlmm-evaluation	acc	0.23691606532472465	0.003705863972342719
		acc_norm	0.2614508165590581	0.0038299294639917887

YeowTong

Datasets version 2.20.0 removes the default variable 'trust_remote_code' and this is requires for all datasets with custom code.

  File "/scratch/users/nus/ytyeo/envs/testOkapi/lib/python3.10/site-packages/datasets/load.py", line 133, in resolve_trust_remote_code
    raise ValueError(
ValueError: The repository for aisingapore/m_hellaswag contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/aisingapore/m_hellaswag.
Please pass the argument `trust_remote_code=True` to allow custom code to be run.

YeowTong · 2024-07-01T05:28:44Z

lm_eval/tasks/okapi_mlmm/m_hellaswag/_hellaswag_yaml

+dataset_kwargs:
+  revision: dev


Datasets version 2.20.0 removes the default variable 'trust_remote_code' and this is requires for all datasets with custom code.

File "/scratch/users/nus/ytyeo/envs/testOkapi/lib/python3.10/site-packages/datasets/load.py", line 133, in resolve_trust_remote_code raise ValueError( ValueError: The repository for aisingapore/m_hellaswag contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/aisingapore/m_hellaswag. Please pass the argument `trust_remote_code=True` to allow custom code to be run.

Yier and others added 4 commits June 21, 2024 15:14

add m_hellaswag task, use aisg datasets, add decontamination

928d185

add m_arc task

be69046

add m_mmlu task

e41a796

add task README

57d8dff

YeowTong requested changes Jul 1, 2024

View reviewed changes

add trust_remote_code: true to dataset_kwargs

1d73c00

liyier90 requested a review from YeowTong July 2, 2024 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add okapi task #3

Add okapi task #3

liyier90 commented Jun 25, 2024

YeowTong left a comment

YeowTong Jul 1, 2024

liyier90 Jul 2, 2024

Add okapi task #3

Are you sure you want to change the base?

Add okapi task #3

Conversation

liyier90 commented Jun 25, 2024

Replicating results

m_hellaswag

Command

Output

m_arc

Command

Output

m_mmlu

Command

Output

YeowTong left a comment

Choose a reason for hiding this comment

YeowTong Jul 1, 2024

Choose a reason for hiding this comment

liyier90 Jul 2, 2024

Choose a reason for hiding this comment