Skip to content
This repository has been archived by the owner on Apr 16, 2024. It is now read-only.

Commit

Permalink
Add web scraper driver docs
Browse files Browse the repository at this point in the history
  • Loading branch information
dylanholmes committed Mar 25, 2024
1 parent f376cbe commit eb8e705
Show file tree
Hide file tree
Showing 4 changed files with 88 additions and 236 deletions.
4 changes: 4 additions & 0 deletions .github/actions/init-environment/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,7 @@ runs:
source .venv/bin/activate
echo PATH=$PATH >> $GITHUB_ENV
shell: bash

- name: Install playwright
run: playwright install --with-deps
shell: bash
81 changes: 81 additions & 0 deletions docs/griptape-framework/drivers/web-scraper-drivers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
## Overview
Web scraper drivers can be used to scrape text from the web. They are used by [WebLoader](../../reference/griptape/loaders/web_loader.md) to load text from the web. All web scraper drivers implement the following methods:

* `scrape_url()` scrapes text from a website and returns a [TextArtifact](../../reference/griptape/artifacts/text_artifact.md). The format of the scrapped text is determined by the driver.

## MarkdownifyWebScraperDriver

The [MarkdownifyWebScraperDriver](../../reference/griptape/drivers/web_scraper/markdownify_web_scraper_driver.md) outputs the scraped text in markdown format. It uses [playwright](https://pypi.org/project/playwright/) to render web pages along with dynamically loaded content, and a combination of [beautifulsoup4](https://pypi.org/project/beautifulsoup4/) and [markdownify](https://pypi.org/project/markdownify/) to produce a markdown representation of a webpage. It makes a best effort to produce a markdown representation of a webpage that is concise yet human readable.

### Prerequisites

1. Ensure that the `griptape` package is installed with the `drivers-web-scraper-markdownify` extra.

1. Run `playwright install` to install browsers used by playwright to
render web pages. The `playwright` command should already be installed as a dependency of the `drivers-web-scraper-markdownify` extra. For more details about playwright, see [the playwright docs](https://playwright.dev/python/docs/library).

!!! info
If you are using poetry, then you should run `poetry run playwright install` instead of just `playwright install`.

If you skip this step, you will see the following error when you run your code:
```
playwright._impl._errors.Error: Executable doesn't exist at ...
╔════════════════════════════════════════════════════════════╗
║ Looks like Playwright was just installed or updated. ║
║ Please run the following command to download new browsers: ║
║ ║
║ playwright install ║
║ ║
║ <3 Playwright Team ║
╚════════════════════════════════════════════════════════════╝
```

### Example Usage

Here is an example of how to use it directly:

```python
from griptape.drivers import MarkdownifyWebScraperDriver

driver = MarkdownifyWebScraperDriver()

driver.scrape_url("https://griptape.ai")
```

Here is an example of how to use it with an agent:

```python
from griptape.drivers import MarkdownifyWebScraperDriver
from griptape.loaders import WebLoader
from griptape.tools import TaskMemoryClient, WebScraper
from griptape.structures import Agent

agent = Agent(
tools=[
WebScraper(
web_loader=WebLoader(
web_scraper_driver=MarkdownifyWebScraperDriver(timeout=1000)
),
off_prompt=True,
),
TaskMemoryClient(off_prompt=False),
],
)
agent.run("List all email addresses on griptape.ai in a flat numbered markdown list.")
```

## TrafilaturaWebScraperDriver

The [TrafilaturaWebScraperDriver](../../reference/griptape/drivers/web_scraper/trafilatura_web_scraper_driver.md) scrapes text from a webpage using the [Trafilatura](https://trafilatura.readthedocs.io) library.

### Example Usage

Here is an example of how to use it directly:

```python
from griptape.drivers import TrafilaturaWebScraperDriver

driver = TrafilaturaWebScraperDriver()

driver.scrape_url("https://griptape.ai")
```
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ nav:
- Image Generation Drivers: "griptape-framework/drivers/image-generation-drivers.md"
- SQL Drivers: "griptape-framework/drivers/sql-drivers.md"
- Image Query Drivers: "griptape-framework/drivers/image-query-drivers.md"
- Web Scraper Drivers: "griptape-framework/drivers/web-scraper-drivers.md"
- Data:
- Overview: "griptape-framework/data/index.md"
- Artifacts: "griptape-framework/data/artifacts.md"
Expand Down
Loading

0 comments on commit eb8e705

Please sign in to comment.