Performance issue with GraphRag on large file processing (7GB) – slow load time and verb function not being triggered #1277

9prodhi · 2024-10-13T23:21:39Z

I am using GraphRag to process a large file (~7GB). While the processing works fine for smaller files (in MB range), the workflow experiences significant delays when handling the larger file. The file takes a long time to load, and after over an hour, the workflow hasn't reached the verb's execution.

Here are the details of the issue:

Small File Processing:

Small files load quickly and the verb functions are called as expected.

Large File Processing:

Loading a ~7GB file takes a very long time, and after one hour of waiting, the verb function (nomic_embed) has not been called.

System specs:

I am using a machine with 128 GB of RAM.

Although the verb function is not being called for larger files yet, I would also like to ask about optimizing performance for large file processing. Here's the relevant code snippet I am using:

import logging
from enum import Enum
from typing import Any, cast
import pandas as pd
import io
from datashaper import (
    AsyncType,
    TableContainer,
    VerbCallbacks,
    VerbInput,
    derive_from_rows,
    verb,
)
from graphrag.index.bootstrap import bootstrap
from graphrag.index.cache import PipelineCache
from graphrag.index.storage import PipelineStorage
from graphrag.index.llm import load_llm
from graphrag.llm import CompletionLLM
from graphrag.config.enums import LLMType

@verb(name="nomic_embed")
async def nomic_embed(
    input: VerbInput,
    cache: PipelineCache,
    storage: PipelineStorage,
    callbacks: VerbCallbacks,
    column: str,
    id_column: str,
    to: str,
    async_mode: AsyncType = AsyncType.AsyncIO,
    num_threads: int = 108,
    batch_size: int = 150000,
    output_file: str = "embed_results.parquet",
    **kwargs,
)

I am using the num_threads and batch_size parameters to parallelize the nomic_embed verb for reducing processing time of large files.

Are there any recommended approach or any additional parameters I should consider for processing large files with GraphRag?

The text was updated successfully, but these errors were encountered:

PassStory · 2024-10-17T09:38:00Z

When building the graph, the most time-consuming part seems to be accessing the LLM. Even though the code thoughtfully uses asynchronous methods, the time consumption is still significant. I attempted to modify the code to batch mode for the LLM, but the data involves multiple layers of API calls, making it difficult to implement. I’m curious whether the size of the data used by the publisher for experiments is only for laboratory mode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with GraphRag on large file processing (7GB) – slow load time and verb function not being triggered #1277

Performance issue with GraphRag on large file processing (7GB) – slow load time and verb function not being triggered #1277

9prodhi commented Oct 13, 2024

PassStory commented Oct 17, 2024

Performance issue with GraphRag on large file processing (7GB) – slow load time and verb function not being triggered #1277

Performance issue with GraphRag on large file processing (7GB) – slow load time and verb function not being triggered #1277

Comments

9prodhi commented Oct 13, 2024

Small File Processing:

Large File Processing:

System specs:

PassStory commented Oct 17, 2024