Add web scraper driver docs

griptape-ai · Mar 25, 2024 · eb8e705 · eb8e705
1 parent f376cbe
commit eb8e705
Show file tree

Hide file tree

Showing 4 changed files with 88 additions and 236 deletions.
diff --git a/.github/actions/init-environment/action.yml b/.github/actions/init-environment/action.yml
@@ -34,3 +34,7 @@ runs:
           source .venv/bin/activate
           echo PATH=$PATH >> $GITHUB_ENV
         shell: bash
+
+      - name: Install playwright
+        run: playwright install --with-deps
+        shell: bash
diff --git a/docs/griptape-framework/drivers/web-scraper-drivers.md b/docs/griptape-framework/drivers/web-scraper-drivers.md
@@ -0,0 +1,81 @@
+## Overview
+Web scraper drivers can be used to scrape text from the web. They are used by [WebLoader](../../reference/griptape/loaders/web_loader.md) to load text from the web. All web scraper drivers implement the following methods:
+
+* `scrape_url()` scrapes text from a website and returns a [TextArtifact](../../reference/griptape/artifacts/text_artifact.md). The format of the scrapped text is determined by the driver.
+
+## MarkdownifyWebScraperDriver
+
+The [MarkdownifyWebScraperDriver](../../reference/griptape/drivers/web_scraper/markdownify_web_scraper_driver.md) outputs the scraped text in markdown format. It uses [playwright](https://pypi.org/project/playwright/) to render web pages along with dynamically loaded content, and a combination of [beautifulsoup4](https://pypi.org/project/beautifulsoup4/) and [markdownify](https://pypi.org/project/markdownify/) to produce a markdown representation of a webpage. It makes a best effort to produce a markdown representation of a webpage that is concise yet human readable.
+
+### Prerequisites
+
+1. Ensure that the `griptape` package is installed with the `drivers-web-scraper-markdownify` extra.
+
+1. Run `playwright install` to install browsers used by playwright to
+render web pages. The `playwright` command should already be installed as a dependency of the `drivers-web-scraper-markdownify` extra. For more details about playwright, see [the playwright docs](https://playwright.dev/python/docs/library).
+
+    !!! info
+        If you are using poetry, then you should run `poetry run playwright install` instead of just `playwright install`.
+
+        If you skip this step, you will see the following error when you run your code:
+        ```
+        playwright._impl._errors.Error: Executable doesn't exist at ...
+        ╔════════════════════════════════════════════════════════════╗
+        ║ Looks like Playwright was just installed or updated.       ║
+        ║ Please run the following command to download new browsers: ║
+        ║                                                            ║
+        ║     playwright install                                     ║
+        ║                                                            ║
+        ║ <3 Playwright Team                                         ║
+        ╚════════════════════════════════════════════════════════════╝
+        ```
+
+### Example Usage
+
+Here is an example of how to use it directly:
+
+```python
+from griptape.drivers import MarkdownifyWebScraperDriver
+
+driver = MarkdownifyWebScraperDriver()
+
+driver.scrape_url("https://griptape.ai")
+```
+
+Here is an example of how to use it with an agent:
+
+```python
+from griptape.drivers import MarkdownifyWebScraperDriver
+from griptape.loaders import WebLoader
+from griptape.tools import TaskMemoryClient, WebScraper
+from griptape.structures import Agent
+
+agent = Agent(
+    tools=[
+        WebScraper(
+            web_loader=WebLoader(
+                web_scraper_driver=MarkdownifyWebScraperDriver(timeout=1000)
+            ),
+            off_prompt=True,
+        ),
+        TaskMemoryClient(off_prompt=False),
+    ],
+)
+agent.run("List all email addresses on griptape.ai in a flat numbered markdown list.")
+```
+
+## TrafilaturaWebScraperDriver
+
+The [TrafilaturaWebScraperDriver](../../reference/griptape/drivers/web_scraper/trafilatura_web_scraper_driver.md) scrapes text from a webpage using the [Trafilatura](https://trafilatura.readthedocs.io) library.
+
+### Example Usage
+
+Here is an example of how to use it directly:
+
+```python
+from griptape.drivers import TrafilaturaWebScraperDriver
+
+driver = TrafilaturaWebScraperDriver()
+
+driver.scrape_url("https://griptape.ai")
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -101,6 +101,7 @@ nav:
           - Image Generation Drivers: "griptape-framework/drivers/image-generation-drivers.md"
           - SQL Drivers: "griptape-framework/drivers/sql-drivers.md"
           - Image Query Drivers: "griptape-framework/drivers/image-query-drivers.md"
+          - Web Scraper Drivers: "griptape-framework/drivers/web-scraper-drivers.md"
       - Data:
           - Overview: "griptape-framework/data/index.md"
           - Artifacts: "griptape-framework/data/artifacts.md"