[Bug]: performance improvements in map_query_to_entities() #1275

mmaitre314 · 2024-10-13T20:55:53Z

Do you need to file an issue?

I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

Doing some profiling to find opportunities to speed up queries, local search spends a good amount of time in
get_entity_by_key(): 7s out of 20s for 49K entities, with another 9s spent waiting for GPT-4o to generate the response. That method makes an O(N) scan on the entity list when at least in the default case where embedding_vectorstore_key == EntityVectorStoreKey.ID it could do a O(1) lookup in the entity dictionary. In a quick test, replacing matched = get_entity_by_key(...) by matched = all_entities_dict.get(result.document.id) effectively made those 7s go away. Also, in the general case of O(N) full scans, since value is constant in get_entity_by_key(), calls to
isinstance(), is_valid_uuid(), and replace() could be moved out of the loop to reduce the hot spot.

Steps to reproduce

Run local search with 50K entities.

Expected Behavior

Most of the query time is spent on generating the AI summary.

GraphRAG Config Used

token_encoder = tiktoken.get_encoding("cl100k_base")

local_search = LocalSearch(
    llm = ChatOpenAI(
        azure_ad_token_provider=azure_ad_token_provider,
        model="gpt-4o",
        api_base="https://<snip>.openai.azure.com",
        api_version="2024-02-15-preview",
        api_type=OpenaiApiType.AzureOpenAI,
        max_retries=1,
    ),
    context_builder = LocalSearchMixedContext(
        community_reports = reports,
        text_units = text_units,
        entities = entities,
        relationships = relationships,
        covariates = None,
        entity_text_embeddings = description_embedding_store,
        embedding_vectorstore_key = EntityVectorStoreKey.ID,
        text_embedder = OpenAIEmbedding(
            azure_ad_token_provider=azure_ad_token_provider,
            api_base="https://<snip>.openai.azure.com",
            api_type=OpenaiApiType.AzureOpenAI,
            api_version="2024-02-15-preview",
            model="text-embedding-ada-002",
            deployment_name="text-embedding-ada-002",
            max_retries=1,
        ),
        token_encoder = token_encoder,
    ),
    token_encoder = token_encoder,
    llm_params={
        "max_tokens": 2_000,
        "temperature": 0.0,
    },
    context_builder_params={
        "text_unit_prop": 0.5,
        "community_prop": 0.1,
        "conversation_history_max_turns": 5,
        "conversation_history_user_turns_only": True,
        "top_k_mapped_entities": 10,
        "top_k_relationships": 10,
        "include_entity_rank": True,
        "include_relationship_weight": True,
        "include_community_rank": False,
        "return_candidate_context": False,
        "embedding_vectorstore_key": EntityVectorStoreKey.ID,
        "max_tokens": 125_000,
    },
    response_type="single paragraph",
)

Logs and screenshots

Output from Python cProfile:

        10433793 function calls (10421805 primitive calls) in 19.684 seconds

   Ordered by: cumulative time
   List reduced from 1189 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001   19.683   19.683 xxx\envs\graphrag\Lib\site-packages\graphrag\query\structured_search\local_search\search.py:141(search)
        1    0.000    0.000    9.382    9.382 xxx\envs\graphrag\Lib\site-packages\graphrag\query\llm\oai\chat_openai.py:63(generate)
        1    0.002    0.002    9.381    9.381 xxx\envs\graphrag\Lib\site-packages\graphrag\query\llm\oai\chat_openai.py:186(_generate)
       22    0.001    0.000    8.824    0.401 xxx\envs\graphrag\Lib\site-packages\httpcore\_sync\http11.py:216(_receive_event)
       19    0.001    0.000    8.790    0.463 xxx\envs\graphrag\Lib\site-packages\httpcore\_backends\sync.py:122(read)
       19    0.000    0.000    8.789    0.463 xxx\envs\graphrag\Lib\ssl.py:1225(recv)
       19    0.000    0.000    8.788    0.463 xxx\envs\graphrag\Lib\ssl.py:1094(read)
       19    8.788    0.463    8.788    0.463 {method 'read' of '_ssl._SSLSocket' objects}
        1    0.002    0.002    7.905    7.905 xxx\envs\graphrag\Lib\site-packages\graphrag\query\context_builder\entity_extraction.py:35(map_query_to_entities)
       20    1.161    0.058    6.948    0.347 xxx\envs\graphrag\Lib\site-packages\graphrag\query\input\retrieval\entities.py:15(get_entity_by_key)
        2    0.000    0.000    6.019    3.009 xxx\envs\graphrag\Lib\site-packages\openai\_base_client.py:1263(post)
        2    0.000    0.000    6.019    3.009 xxx\envs\graphrag\Lib\site-packages\openai\_base_client.py:940(request)
        2    0.000    0.000    6.018    3.009 xxx\envs\graphrag\Lib\site-packages\openai\_base_client.py:962(_request)
        2    0.000    0.000    6.008    3.004 xxx\envs\graphrag\Lib\site-packages\httpx\_client.py:891(send)
        2    0.000    0.000    6.007    3.003 xxx\envs\graphrag\Lib\site-packages\httpx\_client.py:942(_send_handling_auth)
        2    0.000    0.000    6.004    3.002 xxx\envs\graphrag\Lib\site-packages\httpx\_client.py:976(_send_handling_redirects)
        2    0.000    0.000    6.004    3.002 xxx\envs\graphrag\Lib\site-packages\httpx\_client.py:1013(_send_single_request)
        2    0.000    0.000    6.000    3.000 xxx\envs\graphrag\Lib\site-packages\httpx\_transports\default.py:217(handle_request)
        2    0.000    0.000    5.999    2.999 xxx\envs\graphrag\Lib\site-packages\httpcore\_sync\connection_pool.py:159(handle_request)
        2    0.000    0.000    5.997    2.998 xxx\envs\graphrag\Lib\site-packages\httpcore\_sync\connection.py:67(handle_request)

Additional Information

GraphRAG Version: 0.3.6
Operating System: Windows 11
Python Version: 12
Related Issues:

The text was updated successfully, but these errors were encountered:

mmaitre314 added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Oct 13, 2024

mmaitre314 mentioned this issue Oct 13, 2024

Perf optimizations in map_query_to_entities() #1276

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: performance improvements in map_query_to_entities() #1275

[Bug]: performance improvements in map_query_to_entities() #1275

mmaitre314 commented Oct 13, 2024

[Bug]: performance improvements in map_query_to_entities() #1275

[Bug]: performance improvements in map_query_to_entities() #1275

Comments

mmaitre314 commented Oct 13, 2024

Do you need to file an issue?

Describe the bug

Steps to reproduce

Expected Behavior

GraphRAG Config Used

Logs and screenshots

Additional Information