-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Update some stale readme content #449
base: main
Are you sure you want to change the base?
Conversation
* Remove old announcements * Add a link to the doc with open source limitations * Recommend the serverless API or marketplace apps as alternatives. Closes #448
-H 'unstructured-api-key: <YOUR API KEY>' \ | ||
-F 'files=@sample-docs/family-day.eml' \ | ||
| jq -C . | less -R | ||
-F 'files=@sample-docs/winter-sports.epub' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider changing this to use a .pdf
file, and also add:
-F 'split-pdf-page True' \
-F 'split-pdf-allow-failed True' \
-F 'split-pdf-concurrency-level 15'
|
||
## Unstructured SDKs | ||
|
||
We also recommend using our official [Python client](https://github.com/Unstructured-IO/unstructured-python-client) or [Typescript/Javascript client](https://github.com/Unstructured-IO/unstructured-js-client) to interact with the API, whether it's local or one of the paid options above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the SDKs with local?
|
||
Four strategies are available for processing PDF/Images files: `hi_res`, `fast`, `ocr_only` and `auto`. `fast` is the default `strategy` and works well for documents that do not have text embedded in images. | ||
|
||
On the other hand, `hi_res` is the better choice for PDFs that may have text within embedded images, or for achieving greater precision of [element types](https://unstructured-io.github.io/unstructured/getting_started.html#document-elements) in the response JSON. Please be aware that, as of writing, `hi_res` requests may take 20 times longer to process compared to the `fast` option. See the example below for making a `hi_res` request. | ||
|
||
``` | ||
curl -X 'POST' \ | ||
'https://api.unstructured.io/general/v0/general' \ | ||
'https://api.unstructuredapp.io/general/v0/general' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding the additional -F
parameters for PDFs, as listed above.
|
||
The `hi_res` strategy supports different models, and the default is `detectron2onnx`. You can also specify `hi_res_model_name` parameter to run `hi_res` strategy with the chipper model while using the host API: | ||
|
||
``` | ||
curl -X 'POST' \ | ||
'https://api.unstructured.io/general/v0/general' \ | ||
'https://api.unstructuredapp.io/general/v0/general' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding the additional -F
parameters for PDFs, as listed above.
|
||
Note: This kwarg will eventually be deprecated. Please use `languages`. | ||
You can also specify what languages to use for OCR with the `ocr_languages` kwarg. See the [Tesseract documentation](https://github.com/tesseract-ocr/tessdata) for a full list of languages and install instructions. OCR is only applied if the text is not already available in the PDF document. | ||
|
||
``` | ||
curl -X 'POST' \ | ||
'https://api.unstructured.io/general/v0/general' \ | ||
'https://api.unstructuredapp.io/general/v0/general' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using PDF here? If so, also consider adding the additional -F
parameters for PDFs, as listed above.
|
||
You can also specify what languages to use for OCR with the `languages` kwarg. See the [Tesseract documentation](https://github.com/tesseract-ocr/tessdata) for a full list of languages and install instructions. OCR is only applied if the text is not already available in the PDF document. | ||
|
||
``` | ||
curl -X 'POST' \ | ||
'https://api.unstructured.io/general/v0/general' \ | ||
'https://api.unstructuredapp.io/general/v0/general' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using PDF here? If so, also consider adding the additional -F parameters for PDFs, as listed above.
|
||
When elements are extracted from PDFs or images, it may be useful to get their bounding boxes as well. Set the `coordinates` parameter to `true` to add this field to the elements in the response. | ||
|
||
``` | ||
curl -X 'POST' \ | ||
'https://api.unstructured.io/general/v0/general' \ | ||
'https://api.unstructuredapp.io/general/v0/general' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding the additional -F parameters for PDFs, as listed above.
|
||
Currently, we provide support for enabling and disabling table extraction for all file types. Set parameter `skip_infer_table_types` to specify the document types that you want to skip table extraction with. By default, we enable table extraction | ||
for all file types (`skip_infer_table_types=[]`). Again, please note that table extraction only works with `hi_res` strategy. For example, if you want to skip table extraction for images, you can pass a list with matching image file types: | ||
|
||
``` | ||
curl -X 'POST' \ | ||
'https://api.unstructured.io/general/v0/general' \ | ||
'https://api.unstructuredapp.io/general/v0/general' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding the additional -F parameters for PDFs, as listed above.
|
||
You can send gzipped file and api will un-gzip it. | ||
|
||
``` | ||
curl -X 'POST' \ | ||
'https://api.unstructured.io/general/v0/general' \ | ||
'https://api.unstructuredapp.io/general/v0/general' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding the additional -F parameters for PDFs, as listed above.
|
||
For supported filetypes, set the `include_page_breaks` parameter to `true` to include `PageBreak` elements in the output. | ||
|
||
``` | ||
curl -X 'POST' \ | ||
'https://api.unstructured.io/general/v0/general' \ | ||
'https://api.unstructuredapp.io/general/v0/general' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding the additional -F parameters for PDFs, as listed above.
@@ -276,7 +304,7 @@ The `by_title` strategy has the same behaviors except document section boundarie | |||
|
|||
``` | |||
curl -X 'POST' | |||
'https://api.unstructured.io/general/v0/general' \ | |||
'https://api.unstructuredapp.io/general/v0/general' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding the additional -F parameters for PDFs, as listed above.
</div> | ||
|
||
<h1 align="center"> | ||
<p>Open Source Unstructured API</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, we've been calling it the "Unstructured open source library" in the docs.
</h3> | ||
|
||
This repo implements a pre-processing pipeline for the following documents. Currently, the pipeline is capable of recognizing the file type and choosing the relevant partition function to process the file. | ||
This repo implements a FastAPI server with the partitioning functionality of the [Unstructured library](https://github.com/Unstructured-IO/unstructured). It has one endpoint that accepts any of the following filetypes, and returns the contents as structured JSON. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't want to highlight the open source library as much, consider not linking to it from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you can target more than one endpoint, right?
|
||
Try our hosted API! It's freely available to use with any of the filetypes listed above. This is the easiest way to get started. If you'd like to host your own version of the API, jump down to the [Developer Quickstart Guide](#developer-quick-start). | ||
You can try it out by running the docker container and sending one of the sample files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"docker" -> "Docker"?
Also, would readers already know how to run the Docker container here? If not, consider linking to how to do it?
@@ -77,13 +105,13 @@ The `ocr_only` strategy runs the document through Tesseract for OCR. Currently, | |||
|
|||
For the best of all worlds, `auto` will determine when a page can be extracted using `fast` or `ocr_only` mode, otherwise it will fall back to `hi_res`. | |||
|
|||
#### Hi Res model name | |||
### Hi Res model name | |||
|
|||
The `hi_res` strategy supports different models, and the default is `detectron2onnx`. You can also specify `hi_res_model_name` parameter to run `hi_res` strategy with the chipper model while using the host API: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it layout_v1.1.0
now?
@@ -94,14 +122,14 @@ The `hi_res` strategy supports different models, and the default is `detectron2o | |||
|
|||
We also support models to be used locally, for example, `yolox`. Please refer to the `using-the-api-locally` section for more information on how to use the local API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For API, we're recommending layout_v1.0.0
over yolox
, aren't we?
@@ -94,14 +122,14 @@ The `hi_res` strategy supports different models, and the default is `detectron2o | |||
|
|||
We also support models to be used locally, for example, `yolox`. Please refer to the `using-the-api-locally` section for more information on how to use the local API. | |||
|
|||
#### OCR languages | |||
### OCR languages | |||
|
|||
Note: This kwarg will eventually be deprecated. Please use `languages`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This kwarg..." which one?
-H 'accept: application/json' \ | ||
-H 'Content-Type: multipart/form-data' \ | ||
-F 'files=@sample-docs/layout-parser-paper.pdf' \ | ||
-F 'coordinates=true' \ | ||
| jq -C . | less -R | ||
``` | ||
|
||
#### Skip Table Extraction | ||
### Skip Table Extraction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using "Sentence case" for headings instead of "Title Case," for consistency. For example, in this case, "Skip table extraction."
Closes #448