Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with GraphRag on large file processing (7GB) – slow load time and verb function not being triggered #1277

Open
9prodhi opened this issue Oct 13, 2024 · 1 comment

Comments

@9prodhi
Copy link
Contributor

9prodhi commented Oct 13, 2024

I am using GraphRag to process a large file (~7GB). While the processing works fine for smaller files (in MB range), the workflow experiences significant delays when handling the larger file. The file takes a long time to load, and after over an hour, the workflow hasn't reached the verb's execution.

Here are the details of the issue:

Small File Processing:

  • Small files load quickly and the verb functions are called as expected.

Large File Processing:

  • Loading a ~7GB file takes a very long time, and after one hour of waiting, the verb function (nomic_embed) has not been called.

System specs:

  • I am using a machine with 128 GB of RAM.

Although the verb function is not being called for larger files yet, I would also like to ask about optimizing performance for large file processing. Here's the relevant code snippet I am using:

import logging
from enum import Enum
from typing import Any, cast
import pandas as pd
import io
from datashaper import (
    AsyncType,
    TableContainer,
    VerbCallbacks,
    VerbInput,
    derive_from_rows,
    verb,
)
from graphrag.index.bootstrap import bootstrap
from graphrag.index.cache import PipelineCache
from graphrag.index.storage import PipelineStorage
from graphrag.index.llm import load_llm
from graphrag.llm import CompletionLLM
from graphrag.config.enums import LLMType

@verb(name="nomic_embed")
async def nomic_embed(
    input: VerbInput,
    cache: PipelineCache,
    storage: PipelineStorage,
    callbacks: VerbCallbacks,
    column: str,
    id_column: str,
    to: str,
    async_mode: AsyncType = AsyncType.AsyncIO,
    num_threads: int = 108,
    batch_size: int = 150000,
    output_file: str = "embed_results.parquet",
    **kwargs,
)

I am using the num_threads and batch_size parameters to parallelize the nomic_embed verb for reducing processing time of large files.

Are there any recommended approach or any additional parameters I should consider for processing large files with GraphRag?

@PassStory
Copy link

When building the graph, the most time-consuming part seems to be accessing the LLM. Even though the code thoughtfully uses asynchronous methods, the time consumption is still significant. I attempted to modify the code to batch mode for the LLM, but the data involves multiple layers of API calls, making it difficult to implement. I’m curious whether the size of the data used by the publisher for experiments is only for laboratory mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants