chore: Update some stale readme content #449

awalker4 · 2024-08-14T16:13:32Z

Remove old announcements
Add a link to the doc with open source limitations
Recommend the serverless API or marketplace apps as production alternatives.
Link to the official SDKs
Switch out the free api URL when we use it. Longer term, most of the parameter descriptions here should just refer back to our docs. For now, let's update these curl examples.

Closes #448

* Remove old announcements * Add a link to the doc with open source limitations * Recommend the serverless API or marketplace apps as alternatives. Closes #448

Paul-Cornell · 2024-08-14T16:50:48Z

README.md

- -H 'unstructured-api-key: <YOUR API KEY>' \
- -F 'files=@sample-docs/family-day.eml' \
- | jq -C . | less -R
+ -F 'files=@sample-docs/winter-sports.epub'


Consider changing this to use a .pdf file, and also add:

-F 'split-pdf-page True' \ -F 'split-pdf-allow-failed True' \ -F 'split-pdf-concurrency-level 15'

Paul-Cornell · 2024-08-14T16:51:55Z

README.md

+
+## Unstructured SDKs
+
+We also recommend using our official [Python client](https://github.com/Unstructured-IO/unstructured-python-client) or [Typescript/Javascript client](https://github.com/Unstructured-IO/unstructured-js-client) to interact with the API, whether it's local or one of the paid options above.


Can you use the SDKs with local?

Paul-Cornell · 2024-08-14T16:52:35Z

README.md


 Four strategies are available for processing PDF/Images files: `hi_res`, `fast`, `ocr_only` and `auto`. `fast` is the default `strategy` and works well for documents that do not have text embedded in images.

 On the other hand, `hi_res` is the better choice for PDFs that may have text within embedded images, or for achieving greater precision of [element types](https://unstructured-io.github.io/unstructured/getting_started.html#document-elements) in the response JSON. Please be aware that, as of writing, `hi_res` requests may take 20 times longer to process compared to the `fast` option. See the example below for making a `hi_res` request.

 ```
 curl -X 'POST' \
- 'https://api.unstructured.io/general/v0/general' \
+ 'https://api.unstructuredapp.io/general/v0/general' \


Consider adding the additional -F parameters for PDFs, as listed above.

Paul-Cornell · 2024-08-14T16:52:42Z

README.md


 The `hi_res` strategy supports different models, and the default is `detectron2onnx`. You can also specify `hi_res_model_name` parameter to run `hi_res` strategy with the chipper model while using the host API:

 ```
 curl -X 'POST' \
- 'https://api.unstructured.io/general/v0/general' \
+ 'https://api.unstructuredapp.io/general/v0/general' \


Consider adding the additional -F parameters for PDFs, as listed above.

Paul-Cornell · 2024-08-14T16:53:10Z

README.md


 Note: This kwarg will eventually be deprecated. Please use `languages`.
 You can also specify what languages to use for OCR with the `ocr_languages` kwarg. See the [Tesseract documentation](https://github.com/tesseract-ocr/tessdata) for a full list of languages and install instructions. OCR is only applied if the text is not already available in the PDF document.

 ```
 curl -X 'POST' \
- 'https://api.unstructured.io/general/v0/general' \
+ 'https://api.unstructuredapp.io/general/v0/general' \


Consider using PDF here? If so, also consider adding the additional -F parameters for PDFs, as listed above.

Paul-Cornell · 2024-08-14T16:53:29Z

README.md


 You can also specify what languages to use for OCR with the `languages` kwarg. See the [Tesseract documentation](https://github.com/tesseract-ocr/tessdata) for a full list of languages and install instructions. OCR is only applied if the text is not already available in the PDF document.

 ```
 curl -X 'POST' \
- 'https://api.unstructured.io/general/v0/general' \
+ 'https://api.unstructuredapp.io/general/v0/general' \


Consider using PDF here? If so, also consider adding the additional -F parameters for PDFs, as listed above.

Paul-Cornell · 2024-08-14T16:53:39Z

README.md


 When elements are extracted from PDFs or images, it may be useful to get their bounding boxes as well. Set the `coordinates` parameter to `true` to add this field to the elements in the response.

 ```
 curl -X 'POST' \
- 'https://api.unstructured.io/general/v0/general' \
+ 'https://api.unstructuredapp.io/general/v0/general' \