Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

status command should show if OCR has completed #17

Open
simonw opened this issue Jun 30, 2022 · 2 comments
Open

status command should show if OCR has completed #17

simonw opened this issue Jun 30, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Jun 30, 2022

This is actually quite difficult.

It turns out the textract-output/JOB_ID folder is created, empty, early on in the process. Then files called 1 and 2 and so-on are added to it - but they're not all added at once, so the existence of files in that folder doesn't necessarily mean that the OCR process has completed for that job ID.

@simonw simonw added the enhancement New feature or request label Jun 30, 2022
@simonw
Copy link
Owner Author

simonw commented Jun 30, 2022

I think the only reliable way of telling if OCR has completed is to call inspect-job:

But that's quite expensive, because it also returns the first page of JSON - which could be ~1MB of data.

I think the most efficient way to do this would be to check the expensive API for completion of each job, but then to update the .s3-ocr.json file for that key to cache the fact that we know that OCR has completed.

@simonw
Copy link
Owner Author

simonw commented Jun 30, 2022

Another option: add a file called key.pdf.s3-ocr-complete.json indicating the OCR has finished. That way we don't need to GET each individual file to check status - we can check status on everything just by listing all keys in the bucket.

Even better: if we change the design of those JSON files to all live in the s3-ocr/ folder instead we can do a status check just with a single fetch of every key starting with that prefix, see:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant