Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

500 Server Error when parsing some PDF #465

Open
AmrAhmedElagoz opened this issue Sep 30, 2024 · 0 comments
Open

500 Server Error when parsing some PDF #465

AmrAhmedElagoz opened this issue Sep 30, 2024 · 0 comments

Comments

@AmrAhmedElagoz
Copy link

Describe the bug
INFO:main:Detected file type: application/pdf INFO:main:Sending request to https://api.unstructured.io/general/v0/general INFO:main:Headers: {'Accept': 'application/json', 'unstructured-api-key': 'xxx'} INFO:main:File being sent: Mahmoud_Gamal_Resume.pdf INFO:main:Response status code: 500 INFO:main:Response headers: {'Date': 'Sun, 29 Sep 2024 21:51:15 GMT', 'Content-Type': 'application/json', 'Content-Length': '47', 'Connection': 'keep-alive', 'server': 'uvicorn'} ERROR:main:500 Internal Server Error: {"detail":"'6114cee903d6a72fa0370b97d042b71c'"} ERROR:main:Error details: { "detail": "'6114cee903d6a72fa0370b97d042b71c'" } HTTP error occurred: 500 Server Error: Internal Server Error for url: https://api.unstructured.io/general/v0/general

To Reproduce
A simple code to reproduce the error:

import requests
import json
import logging
import magic

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def is_pdf(file_path):
    mime = magic.Magic(mime=True)
    file_type = mime.from_file(file_path)
    logger.info(f"Detected file type: {file_type}")
    return file_type == "application/pdf"

def parse_pdf(api_key, file_path):
    if not is_pdf(file_path):
        return "Error: The provided file is not a PDF."

    url = "https://api.unstructured.io/general/v0/general"

    headers = {
        "Accept": "application/json",
        "unstructured-api-key": api_key
    }

    try:
        with open(file_path, "rb") as file:
            files = {"files": (file_path, file, "application/pdf")}
            
            logger.info(f"Sending request to {url}")
            logger.info(f"Headers: {headers}")
            logger.info(f"File being sent: {file_path}")
            
            response = requests.post(url, headers=headers, files=files)

        logger.info(f"Response status code: {response.status_code}")
        logger.info(f"Response headers: {response.headers}")

        response.raise_for_status()

        return response.json()
    except requests.exceptions.HTTPError as http_err:
        if response.status_code == 500:
            logger.error(f"500 Internal Server Error: {response.text}")
            try:
                error_details = response.json()
                logger.error(f"Error details: {json.dumps(error_details, indent=2)}")
            except json.JSONDecodeError:
                logger.error("Could not parse error response as JSON")
        return f"HTTP error occurred: {http_err}"
    except requests.exceptions.RequestException as err:
        return f"An error occurred: {err}"
    except Exception as e:
        return f"An unexpected error occurred: {e}"

def main():
    api_key = "YOUR_API_KEY_HERE"
    file_path = "Mahmoud_Gamal_Resume.pdf"

    result = parse_pdf(api_key, file_path)
    print(result)

if __name__ == "__main__":
    main()
  • Filetype: PDF
  • Any additional API parameters: No

Environment:

  • Ubuntu 22.04
  • SKD

Additional context
I attached one of the .pdf that produces such error, a side note when I used llamaparse it worked fine with this pdf
Mahmoud_Gamal_Resume.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant