Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PDSM] Medical Testcases for benchmarking #157

Merged
merged 98 commits into from
Jul 18, 2024
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
5a8749e
change iterations
Apr 2, 2024
4ecca83
I love python
SturmCamper Apr 5, 2024
b3b18a9
- Local-Test-Remove in conftest.py
SturmCamper Apr 5, 2024
d79cb67
Working project version
marlis-en Apr 21, 2024
813e2c1
First pytest not working
marlis-en Apr 21, 2024
ddd8bf7
1 test geht
SeraLunatic Apr 21, 2024
41ddf2a
test name
SeraLunatic Apr 21, 2024
29a62b8
second test
SeraLunatic Apr 21, 2024
3a38516
added physikum questions
SeraLunatic Apr 23, 2024
fb7c431
regex suche
SeraLunatic Apr 23, 2024
0f772de
lowercase wieder rein ups
SeraLunatic Apr 23, 2024
fb7905d
ADD Emergency Testcase
Apr 27, 2024
827aad8
CHANGE: regex and not regex in one runtest and csv
SeraLunatic Apr 29, 2024
7b1d785
REFACTOR test names changed
SeraLunatic Apr 29, 2024
3356c9e
ADDED new regex EEG-Questions
SeraLunatic Apr 30, 2024
392da50
ADDED new yes_no EEG-Questions
SeraLunatic Apr 30, 2024
a0ca30a
Merge pull request #1 from mehizli/datascience_fragen
ytehran May 14, 2024
9b4cf61
Merge pull request #2 from mehizli/testcase_emergency
ytehran May 14, 2024
a3788b2
test
May 14, 2024
8dcc567
Merge branch 'develop' into meli-test-file
May 14, 2024
8b1321d
indent fixes
May 14, 2024
84abb7d
Merge pull request #3 from mehizli/meli-test-file
ytehran May 14, 2024
4caf83a
FIXED Encoding Bug
SeraLunatic May 15, 2024
886c336
Merge pull request #4 from mehizli/bugfix/data_not_found
SeraLunatic May 15, 2024
1768961
Added new testcases for mental diseases
marlis-en May 15, 2024
372e3c8
test cases fixes
May 15, 2024
d731dba
Merge branch 'develop' into meli-test-file
May 15, 2024
e5be05f
test cases indent fixes
May 15, 2024
20183cb
added englisch translations
May 20, 2024
b177a7d
test for oncology
May 20, 2024
3dd8930
ADD translation for physikum
SeraLunatic May 22, 2024
5e2d180
ADD translation EEG
SeraLunatic May 22, 2024
a53a8be
REFACTORED and ADDED questions for mental diseases
marlis-en May 22, 2024
0a4b75a
WIP wrong_asnwer csv
SeraLunatic May 22, 2024
d3c6fce
BUGFIX eeg_answer
SeraLunatic May 22, 2024
c853b72
ADD writing csv with expected, wrong and failure_group
SeraLunatic May 22, 2024
f3a1597
ADD regex failure_groups
SeraLunatic May 22, 2024
4d67b3c
ADD regex failure_groups and better synonym tracker
SeraLunatic May 22, 2024
64e27d4
Merge pull request #5 from mehizli/meli-test-file
ytehran May 22, 2024
7ccd4f4
Merge pull request #6 from mehizli/translate_quest
ytehran May 22, 2024
83b4905
Merge pull request #7 from mehizli/testcase_mental_diseases
ytehran May 22, 2024
ad7a0da
Merge pull request #8 from mehizli/develop
ytehran May 22, 2024
b48d03f
Merge branch 'wrong_answers' into develop
SeraLunatic May 22, 2024
861282a
Merge pull request #9 from mehizli/develop
ytehran May 22, 2024
2b4f6ad
Translate the emergency cases and updating to the new schema
May 22, 2024
9f4e039
New cardiology cases in German
May 24, 2024
f871db0
Case translated into English
May 24, 2024
74b821c
Merge pull request #10 from mehizli/meli-test-file
ytehran May 24, 2024
eb73efb
Merge branch 'develop' into cardio_update
May 24, 2024
4a4fd38
Refactor and clean code
May 24, 2024
4b53bb0
Merge pull request #11 from mehizli/cardio_update
ytehran May 24, 2024
d4c433d
Merge pull request #12 from mehizli/develop
ytehran May 24, 2024
72c7429
ADDED new testcase dermatology
marlis-en May 27, 2024
be9cdc9
formatting
slobentanzer Jun 4, 2024
14490f1
revert doubly run tests
slobentanzer Jun 4, 2024
5c10b8b
revert non-skipping
slobentanzer Jun 4, 2024
2714b9a
comment out models for test purposes
slobentanzer Jun 10, 2024
49e4e41
Merge branch 'main' into pr/ytehran/157
slobentanzer Jun 10, 2024
018191e
pre-commit
slobentanzer Jun 10, 2024
5328bd9
correct function name
slobentanzer Jun 10, 2024
af82910
Merge branch 'main' into pr/ytehran/157
slobentanzer Jun 10, 2024
c137e4c
change nomenclature: `wrong_result` -> `failure_mode`
slobentanzer Jun 10, 2024
b1125cb
record single scores, calculate standard deviation
slobentanzer Jun 10, 2024
fffadab
delete venv/.env in gitignore
Jun 11, 2024
8477af7
delete test results
slobentanzer Jun 11, 2024
4caed92
Merge branch 'main' of https://github.com/mehizli/biochatter_pdsm int…
slobentanzer Jun 11, 2024
5cb9e52
one iteration for dev
slobentanzer Jun 11, 2024
b3e2908
bring back text extraction results
slobentanzer Jun 11, 2024
be8562e
rename `correctness` to `medical_exam`, more specific
slobentanzer Jun 11, 2024
d80efe6
run once on openhermes
slobentanzer Jun 11, 2024
b24b8d3
replace case separator (colon)
slobentanzer Jun 11, 2024
bf3808f
reset results
slobentanzer Jun 11, 2024
6ce37f0
run once on representative models
slobentanzer Jun 11, 2024
8aeaaae
REFACTOR testcases mental diseases and dermatology
marlis-en Jun 14, 2024
b73944e
little fixes + einheitliches format der fragen
Jun 15, 2024
c57d0ee
Refine the categorize_failures method
Jun 16, 2024
f5984db
Merge remote-tracking branch 'origin/develop' into develop
Jun 16, 2024
0651824
Merge branch 'develop' into promt-optimization
Jun 16, 2024
dedc67f
Refactor to part single and multiple choice
Jun 16, 2024
6ff88de
Merge pull request #14 from mehizli/promt-optimization
ytehran Jun 16, 2024
37d0916
Merge pull request #15 from mehizli/develop
ytehran Jun 16, 2024
66293a9
first batch of LLMs run on med exam
slobentanzer Jun 18, 2024
a1f4807
another batch of models run
slobentanzer Jun 19, 2024
ac5845f
another batch of open models run
slobentanzer Jun 20, 2024
112bd8d
+ j-notebook for analysis and graphs
SturmCamper Jun 25, 2024
e9bcc7b
Remove due to Deprecation (Just throwing errors while benchmarking)
SturmCamper Jun 25, 2024
f8393e8
J-Notebook, for simple Graph generation
SturmCamper Jun 25, 2024
99260b6
Added documentation of functions
marlis-en Jun 25, 2024
770802c
Documentation of the data analysis
marlis-en Jun 25, 2024
2cb47c9
Added Lang/cat Graph
SturmCamper Jun 25, 2024
70f7cc8
Merge branch 'stats_graphs' of https://github.com/mehizli/biochatter_…
SturmCamper Jun 25, 2024
f40cae7
Merge branch 'main' into main
slobentanzer Jul 2, 2024
ee0482b
Merge branch 'main' into main
slobentanzer Jul 2, 2024
c1490eb
Merge pull request #16 from mehizli/develop
ytehran Jul 15, 2024
40086de
Changes to fix PR
Jul 15, 2024
c7a4695
Fixed the calc of the std
Jul 15, 2024
e753aec
Merge branch 'stats_graphs' of github.com:mehizli/biochatter_pdsm int…
Jul 15, 2024
ad2070d
Merge pull request #18 from mehizli/stats_graphs
ytehran Jul 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ dist/
__pycache__/
.venv
.pytest_cache
venv/.env
slobentanzer marked this conversation as resolved.
Show resolved Hide resolved
.env
*.mp3
.cache
Expand Down
138 changes: 138 additions & 0 deletions benchmark/benchmark_utils.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
import pytest

import pandas as pd
import re
from nltk.corpus import wordnet
from datetime import datetime


Expand Down Expand Up @@ -30,6 +32,9 @@ def benchmark_already_executed(
"""
task_results = return_or_create_result_file(task)

# check if failure group csv already exists
return_or_create_wrong_result_file(task)

if task_results.empty:
return False

Expand Down Expand Up @@ -96,6 +101,47 @@ def return_or_create_result_file(
return results


def return_or_create_wrong_result_file(task: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an intuitive name for the function. I had trouble understanding what it is for just from reading the code (what is a "wrong result file"). I would suggest naming the entire process something like "failure mode identification" such that it is intuitively clear what is happening. What you are doing is saving responses in case of a failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the naming in c137e4c

"""
Returns the wrong result file for the task or creates it if it does not exist.

Args:
task (str): The benchmark task, e.g. "biocypher_query_generation"

Returns:
pd.DataFrame: The wrong result file for the task
"""
file_path = get_wrong_result_file_path(task)
try:
results = pd.read_csv(file_path, header=0)
except (pd.errors.EmptyDataError, FileNotFoundError):
results = pd.DataFrame(
columns=[
"model_name",
"subtask",
"wrong_answer",
"expected_answer",
"failure_groups",
"md5_hash",
"datetime",
]
)
results.to_csv(file_path, index=False)
return results


def get_wrong_result_file_path(task: str) -> str:
"""Returns the path to the wrong result file.

Args:
task (str): The benchmark task, e.g. "biocypher_query_generation"

Returns:
str: The path to the wrong result file
"""
return f"benchmark/results/{task}_failure_groups.csv"


def write_results_to_file(
model_name: str,
subtask: str,
Expand Down Expand Up @@ -126,6 +172,98 @@ def write_results_to_file(
results.to_csv(file_path, index=False)


def write_wrong_results_to_file(
model_name: str,
subtask: str,
wrong_answer: str,
expected_answer: str,
failure_groups: str,
md5_hash: str,
file_path: str,
):
"""Writes the wrong benchmark results for the subtask to the result file.

Args:
model_name (str): The model name, e.g. "gpt-3.5-turbo"
subtask (str): The benchmark subtask test case, e.g. "entities"
wrong_answer (str): The wrong answer given to the subtask
expected_answer (str): The expected for the subtask
failure_groups (str): The group of the failure e.g. "Wrong count of words"
md5_hash (str): The md5 hash of the test case
file_path (str): The path to the result file
"""
results = pd.read_csv(file_path, header=0)
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
new_row = pd.DataFrame(
[[model_name, subtask, wrong_answer, expected_answer, failure_groups, md5_hash, now]],
columns=results.columns,
)
results = pd.concat([results, new_row], ignore_index=True).sort_values(
by=["model_name", "subtask"]
)
results.to_csv(file_path, index=False)


def categorize_failures(wrong_answer, expected_answer, regex=False):

if not regex:

# Check if the answer is right, but the case sensitivity was wrong (e.g. a / A)
if wrong_answer.lower() == expected_answer.lower():
return "Case Sensitivity"

# Check if some of the answer is right (e.g. "a headache instead of a")
elif wrong_answer in expected_answer or expected_answer in wrong_answer:
return "Partial Match"

# Check if the format of the answer is wrong, but the answer otherwise is right (e.g. "a b" instead of "ab")
elif re.sub(r'\s+', '', wrong_answer.lower()) == re.sub(r'\s+', '', expected_answer.lower()):
return "Format Error"

# Check if the answer is a synonym with nltk (e.g. Illness / Sickness)
elif is_synonym(wrong_answer, expected_answer):
return "Synonym"

# Check if the format of the answer is wrong due to numerical or alphabetic differences (e.g. "123" vs "one two three")
elif re.search(r'\w+', wrong_answer) and re.search(r'\w+', expected_answer) and any(char.isdigit() for char in wrong_answer) != any(char.isdigit() for char in expected_answer):
return "Format Error"

# Check if partial match with case sensitivity
elif wrong_answer.lower() in expected_answer.lower() or expected_answer.lower() in wrong_answer.lower():
return "Partial Match / case Sensitivity"

# Else the answer may be completely wrong
else:
return "Other"

else:
# Check if all the words in wrong_answer are expected but some of the expected are missing
if all(word in expected_answer for word in wrong_answer.split()):
return "Words Missing"

# Check if some words in wrong_answer are incorrect (present in wrong_answer but not in expected_answer)
#elif any(word not in expected_answer for word in wrong_answer.split()):
# return "Incorrect Words"

# Check if the entire wrong_answer is completely different from the expected_answer
else:
return "Entire Answer Incorrect"


def is_synonym(word1, word2):
if word2 is "yes" or "no" or "ja" or "nein":
return False

synsets1 = wordnet.synsets(word1)
synsets2 = wordnet.synsets(word2)

for synset1 in synsets1:
for synset2 in synsets2:
if synset1.wup_similarity(synset2) is not None:
return True
return False


# TODO should we use SQLite? An online database (REDIS)?
def get_result_file_path(file_name: str) -> str:
"""Returns the path to the result file.
Expand Down
25 changes: 18 additions & 7 deletions benchmark/conftest.py
Original file line number Diff line number Diff line change
@@ -1,29 +1,28 @@
import os

import requests
from dotenv import load_dotenv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need dotenv? The implementation seems a bit hacky ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a bit of a work around, but this was the only way we could implement the key loading. We have some questions, later per mail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ytehran could you or another team member address this? prevents me from merging.

from xinference.client import Client
import pytest

import numpy as np
import pandas as pd

from biochatter.prompts import BioCypherPromptEngine
from benchmark.load_dataset import get_benchmark_dataset
from .load_dataset import get_benchmark_dataset
from biochatter.llm_connect import GptConversation, XinferenceConversation
from .benchmark_utils import benchmark_already_executed

# how often should each benchmark be run?
N_ITERATIONS = 5
N_ITERATIONS = 1

# which dataset should be used for benchmarking?
BENCHMARK_DATASET = get_benchmark_dataset()

# which models should be benchmarked?
OPENAI_MODEL_NAMES = [
"gpt-3.5-turbo-0613",
"gpt-3.5-turbo-0125",
"gpt-4-0613",
"gpt-4-0125-preview",
"gpt-3.5-turbo-0125"
#"gpt-4-0613"
]

XINFERENCE_MODELS = {
Expand Down Expand Up @@ -148,7 +147,7 @@
for quantization in XINFERENCE_MODELS[model_name]["quantization"]
]

BENCHMARKED_MODELS = OPENAI_MODEL_NAMES + XINFERENCE_MODEL_NAMES
BENCHMARKED_MODELS = OPENAI_MODEL_NAMES #+ XINFERENCE_MODEL_NAMES
BENCHMARKED_MODELS.sort()

# Xinference IP and port
Expand Down Expand Up @@ -233,6 +232,9 @@ def conversation(request, model_name):
prompts={},
correct=False,
)
# delete first dots if venv is in project env
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I would not do this. Rather use .venv and configure your setup to use this env reliably. Env setup is on the user, should not be mandated by the code.

cus_path = os.getcwd() + "../../venv/bin/.env"
load_dotenv(cus_path)
conversation.set_api_key(
os.getenv("OPENAI_API_KEY"), user="benchmark_user"
)
Expand Down Expand Up @@ -304,6 +306,9 @@ def evaluation_conversation():
prompts={},
correct=False,
)
# delete first dots if venv is in project env
cus_path = os.getcwd() + "../../venv/bin/.env"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ytehran please someone remove this so we can merge this PR.

load_dotenv(cus_path)
conversation.set_api_key(os.getenv("OPENAI_API_KEY"), user="benchmark_user")
return conversation

Expand Down Expand Up @@ -396,6 +401,12 @@ def pytest_generate_tests(metafunc):
"test_data_text_extraction",
data_file["text_extraction"],
)
if "test_data_correctness" in metafunc.fixturenames:
metafunc.parametrize(
"test_data_correctness",
data_file["correctness"],
)



@pytest.fixture
Expand Down
Loading