benchmark section

biocypher · Feb 7, 2024 · 27ac834 · 27ac834
1 parent db155f4
commit 27ac834
Showing 1 changed file with 5 additions and 24 deletions.
diff --git a/content/20.results.md b/content/20.results.md
@@ -68,35 +68,16 @@ Secondly, we aim to create benchmark datasets that are complementary to the exis
 Thirdly, we aim to prevent leakage of the benchmark data into the training data of the models, which is a known issue in the general purpose benchmarks, also called memorisation or contamination [@doi:10.48550/arXiv.2310.18018].
 To achieve this goal, we implemented an encrypted pipeline that contains the benchmark datasets and is only accessible to the workflow that executes the benchmark (see Methods).
 
-Current results confirm the prevailing opinion of OpenAI's leading role in LLM performance (Table @tab:benchmark).
+Current results confirm the prevailing opinion of OpenAI's leading role in LLM performance (Figure).
 Since the benchmark datasets were created to specifically cover functions relevant in BioChatter's application domain, the benchmark results are primarily a measure for the LLMs' usefulness in our applications.
-However, they generally reflect the results from general-purpose benchmarks (refs).
-OpenAI's GPT models (gpt-3.5-turbo and gpt-4) lead by some margin, and Meta's LLaMA2 is in second position.
-LLaMA2 in its 70B (70 billion parameters) version shows better performance than the smaller (7B and 13B) variants, but not for all quantisations (reductions in model size due to the reduction of bits representing each parameter).
-The 2- and 3-bit quantisations of the 70B model show worse performance than the 7B model; the 4-bit quantised 70B model performs best among all open-source models, roughly doubling the performance of the 3-bit 70B model.
-The Mixtral 8x7B model (46.7 billion parameters), a generally well-performing current open-source model, shows worse performance than all LLaMA2 models in our benchmark.
+OpenAI's GPT models (gpt-4 and gpt-3.5-turbo) lead by some margin on overall performance and consistency, but several open-source models reach high performance in specific tasks (Figure).
+Of note, performance in open-source models appears to depend on their quantisation level, i.e., the number of bits used to represent the model's parameters.
+For models that offer quantisation options, 4- and 5-bit models perform best, while 2- and 8-bit models perform worse (Figure).
 We continuously extend benchmark data and BioChatter functionalities, add new models, and update the benchmark correspondingly ([https://biochatter.org/benchmark/](https://biochatter.org/benchmark/)).
 
 To evaluate the benefit of BioChatter functionality, we compare the performance of models with and without the use of BioChatter's prompt engine for KG querying.
 The models without prompt engine still have access to the BioCypher schema definition, which details the KG structure, but it does not use the multi-step procedure available through BioChatter.
-Consequently, the models without prompt engine show a lower performance in creating correct queries than the models with prompt engine (Table @tab:benchmark).
-
-<!-- insert table -->
-[]: # Table 1
-[]: # 
-[]: # | Model | Size | F1 score | Precision | Recall | 
-[]: # | --- | --- | --- | --- | --- |
-[]: # | OpenAI GPT-3.5-turbo | 175B | 0.92 | 0.92 | 0.92 |
-[]: # | OpenAI GPT-4 | 1.6T | 0.93 | 0.93 | 0.93 |
-[]: # | Meta LLaMA2 7B | 7B | 0.89 | 0.89 | 0.89 |
-[]: # | Meta LLaMA2 13B | 13B | 0.90 | 0.90 | 0.90 |
-[]: # | Meta LLaMA2 70B | 70B | 0.91 | 0.91 | 0.91 |
-[]: # | Meta LLaMA2 70B 2-bit | 70B | 0.88 | 0.88 | 0.88 |
-[]: # | Meta LLaMA2 70B 3-bit | 70B | 0.87 | 0.87 | 0.87 |
-[]: # | Meta LLaMA2 70B 4-bit | 70B | 0.92 | 0.92 | 0.92 |
-[]: # | Mixtral 8x7B | 46.7B | 0.86 | 0.86 | 0.86 |
-
-{#tab:benchmark}
+Consequently, the models without prompt engine show a lower performance in creating correct queries than the models with prompt engine (Figure).
 
 ### Knowledge Graphs