WIP: Evaluate 1B-Instruct on GSM8k #82

rasdani · 2024-10-21T23:38:50Z

Based off https://github.com/xjdr-alt/entropix/blob/70B/entropix/eval_main.py

This is still WIP as for a correct comparison apply_chat_template() must be implemented for the CustomLLaMAModel to use the gsm8k_cot_llama task.

See also official docs of gsm8k in lm-evaluation-harness.

I will run the evaluation overnight without applying the chat template and without using multi-turn for few-shot.

rasdani · 2024-10-24T07:38:08Z

ok, I fixed a bug and ran both vanilla LLama3.1-1B-Instruct and the sampler with settings of this branch on 200 samples of GSM8k.

Here are the results. @xjdr-alt

Without sampler, took ~5min

With entropix sampler, took ~10min

These are preliminary results of course.
I will run the full benchmark on the main branch soon.

rasdani · 2024-10-24T21:27:42Z

I ran full GSM8k.

Without sampler, took ~30mins:

With entropix sampler, took more than 4 hours.

rasdani · 2024-10-28T09:13:19Z

After applying the chat template, the results differ significantly. I benchmarked the 3B-Instruct model and I could reproduce Meta's original result from their blog post without the sampler.
https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

Without sampler, took ~30 min, batch size 1:

Entropix sampler, took ~10 hours:

Both benched on one 4090.

Is there anything wrong with this branch? @xjdr-alt

rasdani · 2024-10-28T10:26:57Z

I ran the vanilla model with:

lm-eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct --tasks gsm8k_cot_llama --batch_size 1 --log_samples --output_path logs/ --apply_chat_template --fewshot_as_multiturn

And the entropix sampler with:

python entropix/eval_main.py

Both with lm-eval-harness version

lm_eval==0.4.5

xjdr-alt and others added 7 commits October 8, 2024 22:25

[WIP] - for frog

8084588

with nemotron 70B

cdb6f6f

WIP: GSM8K with Llama 3.2 1B-Instruct

b3e3330

fix tokenizer name error by llm_eval

3373313

run CoT zeroshot on both GPUs

f4745b0

run correct task

378d02a

fix: collect tokens correctly

fe17866

WIP: apply chat template, bench 3B model

3e5e4c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Evaluate 1B-Instruct on GSM8k #82

WIP: Evaluate 1B-Instruct on GSM8k #82

rasdani commented Oct 21, 2024

rasdani commented Oct 24, 2024 •

edited

Loading

rasdani commented Oct 24, 2024

rasdani commented Oct 28, 2024 •

edited

Loading

rasdani commented Oct 28, 2024

WIP: Evaluate 1B-Instruct on GSM8k #82

Are you sure you want to change the base?

WIP: Evaluate 1B-Instruct on GSM8k #82

Conversation

rasdani commented Oct 21, 2024

rasdani commented Oct 24, 2024 • edited Loading

rasdani commented Oct 24, 2024

rasdani commented Oct 28, 2024 • edited Loading

rasdani commented Oct 28, 2024

rasdani commented Oct 24, 2024 •

edited

Loading

rasdani commented Oct 28, 2024 •

edited

Loading