You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using GraphRag to process a large file (~7GB). While the processing works fine for smaller files (in MB range), the workflow experiences significant delays when handling the larger file. The file takes a long time to load, and after over an hour, the workflow hasn't reached the verb's execution.
Here are the details of the issue:
Small File Processing:
Small files load quickly and the verb functions are called as expected.
Large File Processing:
Loading a ~7GB file takes a very long time, and after one hour of waiting, the verb function (nomic_embed) has not been called.
System specs:
I am using a machine with 128 GB of RAM.
Although the verb function is not being called for larger files yet, I would also like to ask about optimizing performance for large file processing. Here's the relevant code snippet I am using:
import logging
from enum import Enum
from typing import Any, cast
import pandas as pd
import io
from datashaper import (
AsyncType,
TableContainer,
VerbCallbacks,
VerbInput,
derive_from_rows,
verb,
)
from graphrag.index.bootstrap import bootstrap
from graphrag.index.cache import PipelineCache
from graphrag.index.storage import PipelineStorage
from graphrag.index.llm import load_llm
from graphrag.llm import CompletionLLM
from graphrag.config.enums import LLMType
@verb(name="nomic_embed")
async def nomic_embed(
input: VerbInput,
cache: PipelineCache,
storage: PipelineStorage,
callbacks: VerbCallbacks,
column: str,
id_column: str,
to: str,
async_mode: AsyncType = AsyncType.AsyncIO,
num_threads: int = 108,
batch_size: int = 150000,
output_file: str = "embed_results.parquet",
**kwargs,
)
I am using the num_threads and batch_size parameters to parallelize the nomic_embed verb for reducing processing time of large files.
Are there any recommended approach or any additional parameters I should consider for processing large files with GraphRag?
The text was updated successfully, but these errors were encountered:
When building the graph, the most time-consuming part seems to be accessing the LLM. Even though the code thoughtfully uses asynchronous methods, the time consumption is still significant. I attempted to modify the code to batch mode for the LLM, but the data involves multiple layers of API calls, making it difficult to implement. I’m curious whether the size of the data used by the publisher for experiments is only for laboratory mode.
I am using GraphRag to process a large file (~7GB). While the processing works fine for smaller files (in MB range), the workflow experiences significant delays when handling the larger file. The file takes a long time to load, and after over an hour, the workflow hasn't reached the verb's execution.
Here are the details of the issue:
Small File Processing:
Large File Processing:
System specs:
Although the verb function is not being called for larger files yet, I would also like to ask about optimizing performance for large file processing. Here's the relevant code snippet I am using:
I am using the
num_threads
andbatch_size
parameters to parallelize the nomic_embed verb for reducing processing time of large files.Are there any recommended approach or any additional parameters I should consider for processing large files with GraphRag?
The text was updated successfully, but these errors were encountered: