This repository has been archived by the owner on Apr 16, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 19
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f376cbe
commit eb8e705
Showing
4 changed files
with
88 additions
and
236 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
## Overview | ||
Web scraper drivers can be used to scrape text from the web. They are used by [WebLoader](../../reference/griptape/loaders/web_loader.md) to load text from the web. All web scraper drivers implement the following methods: | ||
|
||
* `scrape_url()` scrapes text from a website and returns a [TextArtifact](../../reference/griptape/artifacts/text_artifact.md). The format of the scrapped text is determined by the driver. | ||
|
||
## MarkdownifyWebScraperDriver | ||
|
||
The [MarkdownifyWebScraperDriver](../../reference/griptape/drivers/web_scraper/markdownify_web_scraper_driver.md) outputs the scraped text in markdown format. It uses [playwright](https://pypi.org/project/playwright/) to render web pages along with dynamically loaded content, and a combination of [beautifulsoup4](https://pypi.org/project/beautifulsoup4/) and [markdownify](https://pypi.org/project/markdownify/) to produce a markdown representation of a webpage. It makes a best effort to produce a markdown representation of a webpage that is concise yet human readable. | ||
|
||
### Prerequisites | ||
|
||
1. Ensure that the `griptape` package is installed with the `drivers-web-scraper-markdownify` extra. | ||
|
||
1. Run `playwright install` to install browsers used by playwright to | ||
render web pages. The `playwright` command should already be installed as a dependency of the `drivers-web-scraper-markdownify` extra. For more details about playwright, see [the playwright docs](https://playwright.dev/python/docs/library). | ||
|
||
!!! info | ||
If you are using poetry, then you should run `poetry run playwright install` instead of just `playwright install`. | ||
|
||
If you skip this step, you will see the following error when you run your code: | ||
``` | ||
playwright._impl._errors.Error: Executable doesn't exist at ... | ||
╔════════════════════════════════════════════════════════════╗ | ||
║ Looks like Playwright was just installed or updated. ║ | ||
║ Please run the following command to download new browsers: ║ | ||
║ ║ | ||
║ playwright install ║ | ||
║ ║ | ||
║ <3 Playwright Team ║ | ||
╚════════════════════════════════════════════════════════════╝ | ||
``` | ||
|
||
### Example Usage | ||
|
||
Here is an example of how to use it directly: | ||
|
||
```python | ||
from griptape.drivers import MarkdownifyWebScraperDriver | ||
|
||
driver = MarkdownifyWebScraperDriver() | ||
|
||
driver.scrape_url("https://griptape.ai") | ||
``` | ||
|
||
Here is an example of how to use it with an agent: | ||
|
||
```python | ||
from griptape.drivers import MarkdownifyWebScraperDriver | ||
from griptape.loaders import WebLoader | ||
from griptape.tools import TaskMemoryClient, WebScraper | ||
from griptape.structures import Agent | ||
|
||
agent = Agent( | ||
tools=[ | ||
WebScraper( | ||
web_loader=WebLoader( | ||
web_scraper_driver=MarkdownifyWebScraperDriver(timeout=1000) | ||
), | ||
off_prompt=True, | ||
), | ||
TaskMemoryClient(off_prompt=False), | ||
], | ||
) | ||
agent.run("List all email addresses on griptape.ai in a flat numbered markdown list.") | ||
``` | ||
|
||
## TrafilaturaWebScraperDriver | ||
|
||
The [TrafilaturaWebScraperDriver](../../reference/griptape/drivers/web_scraper/trafilatura_web_scraper_driver.md) scrapes text from a webpage using the [Trafilatura](https://trafilatura.readthedocs.io) library. | ||
|
||
### Example Usage | ||
|
||
Here is an example of how to use it directly: | ||
|
||
```python | ||
from griptape.drivers import TrafilaturaWebScraperDriver | ||
|
||
driver = TrafilaturaWebScraperDriver() | ||
|
||
driver.scrape_url("https://griptape.ai") | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.