Are there any other metrics we should care beyond `pass@k`? #8

findmyway · 2024-02-19T14:04:31Z

findmyway
Feb 19, 2024
Maintainer

pass@k is not that smooth sometimes, we may also monitor the test cases pass rated to measure LLMs' ability to handle corner cases.
GPT4 seems to like to do CoT by default. However, most open source LLMs generate code snippets first and then followed by an explanation. We may list the result with CoT separately.
Several LLMs changed the function signature in the generated results. This is not what we want in most cases.
Given that python code still dominates the training data, we should monitor the percentage of Julia/Python specific characters.
- For Julia, we may be interested in do, |> , ∉, ÷, @, !, .+, .=, .(, function, and also many built-in functions.
Error types of different LLMs
Some LLMs learned to use external packages beyond built-in functions. We might also be interested in when/how to encourage such behaviors.