[Issue]: Error in Community Report Extraction – GraphRAG Indexing Pipeline #1224

praman1870 · 2024-09-26T17:55:43Z

Do you need to file an issue?

I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the issue

I encountered an issue during the final stage of the GraphRAG indexing pipeline where the create_final_community_reports step failed, but the knowledge graph was successfully created. The error appears to be related to an unsupported response_format with the OpenAI model.

Steps to reproduce

Installed GraphRAG via pip install graphrag.
Ran the indexing pipeline using the following command: python -m graphrag.index --root .
The pipeline progressed successfully through stages like:

create_base_text_units
create_base_extracted_entities
create_summarized_entities
create_final_entities
create_final_nodes
create_final_relationships
create_final_communities

It failed at the create_final_community_reports step.

GraphRAG Config Used

# Paste your config here

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: azure_openai_chat
  model: gpt-4
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  api_base: ${OPENAI_API_BASE}
  api_version: "2024-06-01"
  # organization: <organization_id>
  deployment_name: ${OPENAI_DEPLOYMENT_NAME}
  response_format: "json"
  tokens_per_minute: 10000 # set a leaky bucket throttle
  requests_per_minute: 60 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  # target: required # or all
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: azure_openai_embedding
    model: text-embedding-ada-002
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    deployment_name: "text-embedding-ada-002-ea"
    response_format: "json"
    tokens_per_minute: 10000 # set a leaky bucket throttle
    requests_per_minute: 60 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    
  


chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
    
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

❌ create_final_community_reports
None
❌ Errors occurred during the pipeline run, see logs for more details.
{
"type": "error",
"data": "Community Report Extraction Error",
"stack": "Traceback (most recent call last):
...
openai.BadRequestError: Error code: 400 - {'error': {'message': "Invalid parameter: 'response_format' of type 'json_object' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'response_format', 'code': None}}",
}

Additional Information

GraphRAG Version: v0.3.6
Operating System: Microsoft Windows 10 Enterprise
Python Version: 3.12.0

natoverse · 2024-10-01T21:53:45Z

Are you still seeing this error? "json_object" is definitely supported so this seems like it was either a temporary glitch, or there is something else going on. Can you upload your indexing-engine.log?

praman1870 · 2024-10-02T08:01:36Z

Are you still seeing this error? "json_object" is definitely supported so this seems like it was either a temporary glitch, or there is something else going on. Can you upload your indexing-engine.log?

Thank you for your response. But yes, I’m still facing the same issue with the "json_object" error. I've attached the indexing-engine.log file and the logs.json files.
indexing-engine.log
logs.json

github-actions · 2024-10-10T01:58:45Z

This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.

github-actions · 2024-10-20T02:04:00Z

This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.

praman1870 added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Sep 26, 2024

natoverse added awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response and removed triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Oct 1, 2024

github-actions bot added the stale Used by auto-resolve bot to flag inactive issues label Oct 10, 2024

jgbradley1 changed the title ~~[Issue]: <title> Error in Community Report Extraction – GraphRAG Indexing Pipeline~~ [Issue]: Error in Community Report Extraction – GraphRAG Indexing Pipeline Oct 12, 2024

github-actions bot removed the stale Used by auto-resolve bot to flag inactive issues label Oct 13, 2024

github-actions bot added the stale Used by auto-resolve bot to flag inactive issues label Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Error in Community Report Extraction – GraphRAG Indexing Pipeline #1224

[Issue]: Error in Community Report Extraction – GraphRAG Indexing Pipeline #1224

praman1870 commented Sep 26, 2024 •

edited

Loading

natoverse commented Oct 1, 2024

praman1870 commented Oct 2, 2024

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 20, 2024

[Issue]: Error in Community Report Extraction – GraphRAG Indexing Pipeline #1224

[Issue]: Error in Community Report Extraction – GraphRAG Indexing Pipeline #1224

Comments

praman1870 commented Sep 26, 2024 • edited Loading

Do you need to file an issue?

Describe the issue

Steps to reproduce

GraphRAG Config Used

Logs and screenshots

Additional Information

natoverse commented Oct 1, 2024

praman1870 commented Oct 2, 2024

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 20, 2024

praman1870 commented Sep 26, 2024 •

edited

Loading