Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add okapi task #3

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from
Open

Add okapi task #3

wants to merge 5 commits into from

Conversation

liyier90
Copy link
Collaborator

Add Okapi tasks:

  • arc
  • hellaswag
  • mmlu

for the following languages:

  • id
  • ta
  • vi
  • zh

Implemented a custom sampler to keep mlmm-evaluation's behavior of drawing num_fewshot + 1 samples for fewshot examples.

Modification to mlmm-evaluation dev branch

Modifications to lm-evaluation-harness

Replicating results

m_hellaswag

Had to recreate the dataset https://huggingface.co/datasets/aisingapore/m_hellaswag as the zh subset does not work on the existing dataset on Hugging Face https://huggingface.co/datasets/alexandrainst/m_hellaswag

Command

lm-evaluation-harness

task_name='hellaswag_mlmm_${lang}'
lm_eval \
    --model hf \
    --model_args "pretrained=facebook/opt-125m,trust_remote_code=True" \
    --tasks "${task_name}" \
    --device cuda:0 \
    --batch_size 1 \
    --output_path "results/${task_name}.json" \
    --verbosity DEBUG \
    --log_samples  

mlmm-evaluation

task_name="hellaswag_${lang}"
python main.py \
    --model_args "pretrained=facebook/opt-125m,trust_remote_code=True,use_accelerate=True" \
    --tasks "${task_name}" \
    --output_path "results/${task_name}.json" \
    --no_cache

Output

Task Framework Metric Value Stderr
hellaswag_mlmm_id lm-evaluation-harness acc 0.2680910457375993 0.004590128916219268
acc_norm 0.295361820914752 0.004727325016374767
hellaswag_id mlmm-evaluation acc 0.2680910457375993 0.004590128916219268
acc_norm 0.295361820914752 0.004727325016374767
hellaswag_mlmm_ta lm-evaluation-harness acc 0.24533460121240935 0.004691448887127335
acc_norm 0.282182336859622 0.0049070711061398155
hellaswag_ta mlmm-evaluation acc 0.24533460121240935 0.004691448887127335
acc_norm 0.282182336859622 0.0049070711061398155
hellaswag_mlmm_vi lm-evaluation-harness acc 0.27406679764243613 0.004660205855905225
acc_norm 0.29971621916612096 0.004786529226748529
hellaswag_vi mlmm-evaluation acc 0.27406679764243613 0.004660205855905225
acc_norm 0.29971621916612096 0.004786529226748529
hellaswag_mlmm_zh lm-evaluation-harness acc 0.2691560543924023 0.004607779535508573
acc_norm 0.30110079861860567 0.0047658515880019915
hellaswag_zh mlmm-evaluation acc 0.2691560543924023 0.004607779535508573
acc_norm 0.30110079861860567 0.0047658515880019915

m_arc

Task on lm-evaluation-harness main branch does not have num_fewshot set

Command

lm-evaluation-harness

task_name='arc_mlmm_${lang}'
lm_eval \
    --model hf \
    --model_args "pretrained=facebook/opt-125m,trust_remote_code=True" \
    --tasks "${task_name}" \
    --device cuda:0 \
    --batch_size 1 \
    --output_path "results/${task_name}.json" \
    --verbosity DEBUG \
    --log_samples  

mlmm-evaluation

task_name="arc_${lang}"
python main.py \
    --model_args "pretrained=facebook/opt-125m,trust_remote_code=True,use_accelerate=True" \
    --tasks "${task_name}" \
    --output_path "results/${task_name}.json" \
    --no_cache

Output

Task Framework Metric Value Stderr
arc_mlmm_id lm-evaluation-harness acc 0.18632478632478633 0.011388161139389627
acc_norm 0.2341880341880342 0.01238614525915231
arc_id mlmm-evaluation acc 0.18632478632478633 0.011388161139389627
acc_norm 0.2341880341880342 0.01238614525915231
arc_mlmm_ta lm-evaluation-harness acc 0.2075306479859895 0.012005756657930959
acc_norm 0.2530647985989492 0.012871065809204451
arc_ta mlmm-evaluation acc 0.2075306479859895 0.012005756657930959
acc_norm 0.2530647985989492 0.012871065809204451
arc_mlmm_vi lm-evaluation-harness acc 0.19401709401709402 0.011565799456270379
acc_norm 0.22393162393162394 0.0121927158461456
arc_vi mlmm-evaluation acc 0.19401709401709402 0.011565799456270379
acc_norm 0.22393162393162394 0.0121927158461456
arc_mlmm_zh lm-evaluation-harness acc 0.21367521367521367 0.011988664332858041
acc_norm 0.23846153846153847 0.01246372437380267
arc_zh mlmm-evaluation acc 0.21367521367521367 0.011988664332858041
acc_norm 0.23846153846153847 0.01246372437380267

m_mmlu

Task on lm-evaluation-harness main branch uses first_n few shot sampling which differ from mlmm-evaluation's implementation

Command

lm-evaluation-harness

task_name='mmlu_mlmm_${lang}'
lm_eval \
    --model hf \
    --model_args "pretrained=facebook/opt-125m,trust_remote_code=True" \
    --tasks "${task_name}" \
    --device cuda:0 \
    --batch_size 1 \
    --output_path "results/${task_name}.json" \
    --verbosity DEBUG \
    --log_samples  

mlmm-evaluation

task_name="mmlu_${lang}"
python main.py \
    --model_args "pretrained=facebook/opt-125m,trust_remote_code=True,use_accelerate=True" \
    --tasks "${task_name}" \
    --output_path "results/${task_name}.json" \
    --no_cache

Output

Task Framework Metric Value Stderr
mmlu_mlmm_id lm-evaluation-harness acc 0.24982825738493245 0.0037823828185448768
acc_norm 0.27127700175559116 0.0038846516351237013
mmlu_id mlmm-evaluation acc 0.24982825738493245 0.0037823828185448768
acc_norm 0.27127700175559116 0.0038846516351237013
mmlu_mlmm_ta lm-evaluation-harness acc 0.23364083110612985 0.012005756657930959
acc_norm 0.25096991119924134 0.004025954925253322
mmlu_ta mlmm-evaluation acc 0.23364083110612985 0.003929153519886986
acc_norm 0.25096991119924134 0.004025954925253322
mmlu_mlmm_vi lm-evaluation-harness acc 0.24291838922064002 0.003929153519886986
acc_norm 0.265426427805849 0.0038636832594341887
mmlu_vi mlmm-evaluation acc 0.24291838922064002 0.011565799456270379
acc_norm 0.265426427805849 0.0038636832594341887
mmlu_mlmm_zh lm-evaluation-harness acc 0.23691606532472465 0.003705863972342719
acc_norm 0.2614508165590581 0.0038299294639917887
mmlu_zh mlmm-evaluation acc 0.23691606532472465 0.003705863972342719
acc_norm 0.2614508165590581 0.0038299294639917887

Copy link

@YeowTong YeowTong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datasets version 2.20.0 removes the default variable 'trust_remote_code' and this is requires for all datasets with custom code.

  File "/scratch/users/nus/ytyeo/envs/testOkapi/lib/python3.10/site-packages/datasets/load.py", line 133, in resolve_trust_remote_code
    raise ValueError(
ValueError: The repository for aisingapore/m_hellaswag contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/aisingapore/m_hellaswag.
Please pass the argument `trust_remote_code=True` to allow custom code to be run.

Comment on lines +5 to +6
dataset_kwargs:
revision: dev
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datasets version 2.20.0 removes the default variable 'trust_remote_code' and this is requires for all datasets with custom code.

  File "/scratch/users/nus/ytyeo/envs/testOkapi/lib/python3.10/site-packages/datasets/load.py", line 133, in resolve_trust_remote_code
    raise ValueError(
ValueError: The repository for aisingapore/m_hellaswag contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/aisingapore/m_hellaswag.
Please pass the argument `trust_remote_code=True` to allow custom code to be run.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@liyier90 liyier90 requested a review from YeowTong July 2, 2024 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants