Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Evaluate 1B-Instruct on GSM8k #82

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft

Conversation

rasdani
Copy link

@rasdani rasdani commented Oct 21, 2024

Based off https://github.com/xjdr-alt/entropix/blob/70B/entropix/eval_main.py

This is still WIP as for a correct comparison apply_chat_template() must be implemented for the CustomLLaMAModel to use the gsm8k_cot_llama task.

See also official docs of gsm8k in lm-evaluation-harness.

I will run the evaluation overnight without applying the chat template and without using multi-turn for few-shot.

@rasdani
Copy link
Author

rasdani commented Oct 24, 2024

ok, I fixed a bug and ran both vanilla LLama3.1-1B-Instruct and the sampler with settings of this branch on 200 samples of GSM8k.

Here are the results. @xjdr-alt

Without sampler, took ~5min
image

With entropix sampler, took ~10min
image

These are preliminary results of course.
I will run the full benchmark on the main branch soon.

@rasdani
Copy link
Author

rasdani commented Oct 24, 2024

I ran full GSM8k.

Without sampler, took ~30mins:
image

With entropix sampler, took more than 4 hours.
image

@rasdani
Copy link
Author

rasdani commented Oct 28, 2024

After applying the chat template, the results differ significantly. I benchmarked the 3B-Instruct model and I could reproduce Meta's original result from their blog post without the sampler.
https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

Without sampler, took ~30 min, batch size 1:
image

Entropix sampler, took ~10 hours:
image

Both benched on one 4090.

Is there anything wrong with this branch? @xjdr-alt

@rasdani
Copy link
Author

rasdani commented Oct 28, 2024

I ran the vanilla model with:

lm-eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct --tasks gsm8k_cot_llama --batch_size 1 --log_samples --output_path logs/ --apply_chat_template --fewshot_as_multiturn

And the entropix sampler with:

python entropix/eval_main.py

Both with lm-eval-harness version

lm_eval==0.4.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants