ArangoRDF Overhaul: 0.1.0 (#15)

* new: test suite & test data * update: repo config * new: arango_rdf overhaul checkpoint * temp: base ontology files location TBD * new: `flake8` & `mypy` workflows * fix: black, flake, mypy * cleanup * temp: disable black worflow * fix: add flake & mypy dependency * fix: add `rich` dependency * temp: disable `mypy` workflow getting inconsistent `mypy` results between local environment & Github Actions environment * enable: black, mypy * cleanup: `arango_rdf` formatting fixes, mypy fixes, docstring updates, general code cleanup * black: test_main * update: setup files * update: test_pgt_case_3_2 addresses all **list_conversion** parameter cases * update: tests * misc: pragma no cover * fix: test assertions * update: test_rpt_basic_cases * cleanup: main * new: `rich` Live Group progress bars, `batch_size` parameter, code cleanup * update: `rich` trackers in utils * new: `RDFLists` typing * new: ignore E266 flake8 * misc: line breaks * update: `process_rpt_term`, pragma no cover * new: case 7 prototype * update 6.trig * cleanup utils * cleanup * variable renaming, cleanup * cleanup: test data * rework: test suite * remove: examples/data * remove: arango_rdf/ontologies * new: arango_rdf/meta * checkpoint: arango_rdf * fix: isort * fix: compare_graphs * temp fix: mypy * new: fraud detection & imdb tests * checkpoint: main.py * fix: isort * fix: isort (again) * new: meta files switching to `trig` format * checkpoint: tests * checkpoint: arango_rdf working on adb mapping functionality * checkpoint: tests * checkpoint: arango_rdf * cleanup: tests * checkpoint: arango_rdf * update: test cases * cleanup: arango_rdf * fix: rpt case 5 * cleanup: tests * new: cityhash dependency * cleanup & docstrings: arango_rdf flake8 will fail * fix: flake8 autopep8 & yapf did not work, manual fix was required * fix: pgt case 6 * new: __build_subclass_tree() and __identify_best_class() * update: Tree.show() * cleanup main * new: dc.trig & xsd.trig starter files only adding the nodes that are referenced by the other ontologies (OWL, RDF, RDFS) for now * update: tests * cleanup: arango_rdf new `__pgt_add_to_adb_mapping` helper method, add restriction to property type relationship creation if contextualize_graph = True * fix: pgt case 2_4 * more cleanup: arango_rdf * new: load RDF Predicates regardless of contextualize_graph value (PGT only) * update: test_adb_native_graph_to_rdf * attempt fix: missing coverage on L922 coveralls seems to think this line is not covered by tests... * Update README.md * update docstrings * Update README.md * Update README.md * Update README.md * Update README.md * fix: flake8 * Update README.md * new: notebook overhaul baseline * fix: process_val_as_string * remove: unused func * fix: p_already_has_dr * new: __get_literal_val * update: __get_literal_val * fix: subgraph names * cp: adb_key_uri * cleanup: arango_rdf * update: meta trig files * cleanup: arango_rdf * update: tests * more cleanup * fix: flake8 * new: ArangoRDFController * fix: isort * new: use_async (rdf to arangodb) * cleanup * update test params * update: test case 7 * cleanup: insert_adb_docs * update: tests * cleanup * new: ArangoRDF.ipynb output file * revert: d2277fa * new: game of thrones dump * update: tests * cp: arango_rdf * update notebook * new: cases 8-15 in notebook * new: rdf-star support for rpt * Revert "new: rdf-star support for rpt" This reverts commit 2a0ae04. * checkpoint rdf-star support prototyping, * cleanup: adb to rdf * new: rdf_statement_blacklist * discard "List" collection for pgt * new: __get_adb_edge_key * cleanup * checkpoint * cleanup * new: rdf star cases (8 to 15) * new: individualize RPT tests * Update ArangoRDF.ipynb * cleanup * new: hash adb edge ids * update: rdf-star support workaround * new: test cases 8-15 (pgt) * update notebook * cleanup * actions: use ArangoDB 3.11 * fix notebook * cleanup * Update setup.py * new: design doc template used: https://github.com/arangodb/documents/blob/master/DesignDocuments/DesignDocumentTemplate.md * new: simplify_reified_triples flag * new: keyify_literals (rpt) minor cleanup * rework: batch_size (adb to rdf) * use batch_size in tests (adb to rdf & rdf to adb) * new: adb_key URI test case * cleanup based on feedback * fix: mypy * update build workflow * update release workflow * cleanup, todo comments * swap python 3.7 for 3.12 * cleanup tests (case 1 & 6) * cleanup * migrate to `pyproject.toml` * fix lint * fix mypy * flake8 extend ignore trying to workaround 3.12 builds: https://github.com/ArangoDB-Community/ArangoRDF/actions/runs/6856708733/job/18644393745?pr=15
ArangoDB-Community · Dec 4, 2023 · 3ba3d0e · 3ba3d0e
1 parent 3ac4b7e
commit 3ba3d0e
Show file tree

Hide file tree

Showing 127 changed files with 10,615 additions and 1,631 deletions.
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -1,9 +1,8 @@
 name: build
 on:
  workflow_dispatch:
- push:
- branches: [ main ]
  pull_request:
+ push:
  branches: [ main ]
 env:
  PACKAGE_DIR: arango_rdf
@@ -13,7 +12,7 @@ jobs:
  runs-on: ubuntu-latest
  strategy:
  matrix:
- python: ["3.7", "3.8", "3.9"]
+ python: ["3.8", "3.9", "3.10", "3.11", "3.12"]
  name: Python ${{ matrix.python }}
  steps:
  - uses: actions/checkout@v2
@@ -22,17 +21,21 @@ jobs:
  with:
  python-version: ${{ matrix.python }}
  - name: Set up ArangoDB Instance via Docker
- run: docker create --name adb -p 8529:8529 -e ARANGO_ROOT_PASSWORD= arangodb/arangodb:3.9.1
+ run: docker create --name adb -p 8529:8529 -e ARANGO_ROOT_PASSWORD= arangodb/arangodb
  - name: Start ArangoDB Instance
  run: docker start adb
  - name: Setup pip
- run: python -m pip install --upgrade pip setuptools wheel
+ run: pip install --upgrade pip setuptools wheel
  - name: Install packages
  run: pip install .[dev]
  - name: Run black
  run: black --check --verbose --diff --color ${{env.PACKAGE_DIR}} ${{env.TESTS_DIR}}
+ - name: Run flake8
+ run: flake8 ${{env.PACKAGE_DIR}} ${{env.TESTS_DIR}}
  - name: Run isort
  run: isort --check --profile=black ${{env.PACKAGE_DIR}} ${{env.TESTS_DIR}}
+ - name: Run mypy
+ run: mypy ${{env.PACKAGE_DIR}} ${{env.TESTS_DIR}}
  - name: Run pytest
  run: pytest --cov=${{env.PACKAGE_DIR}} --cov-report xml --cov-report term-missing -v --color=yes --no-cov-on-fail --code-highlight=yes
  - name: Publish to coveralls.io

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -3,72 +3,34 @@ on:
  workflow_dispatch:
  release:
  types: [published]
-env:
- PACKAGE_DIR: arango_rdf
- TESTS_DIR: tests
 jobs:
- build:
- runs-on: ubuntu-latest
- strategy:
- matrix:
- python: ["3.7", "3.8", "3.9"]
- name: Python ${{ matrix.python }}
- steps:
- - uses: actions/checkout@v2
- - name: Setup Python ${{ matrix.python }}
- uses: actions/setup-python@v2
- with:
- python-version: ${{ matrix.python }}
- - name: Set up ArangoDB Instance via Docker
- run: docker create --name adb -p 8529:8529 -e ARANGO_ROOT_PASSWORD= arangodb/arangodb:3.9.1
- - name: Start ArangoDB Instance
- run: docker start adb
- - name: Setup pip
- run: python -m pip install --upgrade pip setuptools wheel
- - name: Install packages
- run: pip install .[dev]
- - name: Run black
- run: black --check --verbose --diff --color ${{env.PACKAGE_DIR}} ${{env.TESTS_DIR}}
- - name: Run isort
- run: isort --check --profile=black ${{env.PACKAGE_DIR}} ${{env.TESTS_DIR}}
- - name: Run pytest
- run: pytest --cov=${{env.PACKAGE_DIR}} --cov-report xml --cov-report term-missing -v --color=yes --no-cov-on-fail --code-highlight=yes
- - name: Publish to coveralls.io
- if: matrix.python == '3.8'
- env:
- GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- run: coveralls --service=github
-
  release:
- needs: build
  runs-on: ubuntu-latest
  name: Release package
  steps:
- - uses: actions/checkout@v2
+ - uses: actions/checkout@v4
 
  - name: Fetch complete history for all tags and branches
  run: git fetch --prune --unshallow
 
- - name: Setup python
- uses: actions/setup-python@v2
+ - name: Setup Python
+ uses: actions/setup-python@v4
  with:
- python-version: "3.8"
+ python-version: "3.10"
 
  - name: Install release packages
  run: pip install setuptools wheel twine setuptools-scm[toml]
 
- - name: Install dependencies
- run: pip install .[dev]
-
  - name: Build distribution
  run: python setup.py sdist bdist_wheel
 
- - name: Publish to PyPI Test
+ - name: Publish to Test PyPi
  env:
  TWINE_USERNAME: __token__
  TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD_TEST }}
  run: twine upload --repository testpypi dist/* #--skip-existing
- - name: Publish to PyPI
+
+ - name: Publish to PyPi
  env:
  TWINE_USERNAME: __token__
  TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
@@ -79,7 +41,7 @@ jobs:
  runs-on: ubuntu-latest
  name: Update Changelog
  steps:
- - uses: actions/checkout@v2
+ - uses: actions/checkout@v4
  with:
  fetch-depth: 0
 
@@ -91,10 +53,10 @@ jobs:
  env:
  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
 
- - name: Setup python
- uses: actions/setup-python@v2
+ - name: Setup Python
+ uses: actions/setup-python@v4
  with:
- python-version: "3.8"
+ python-version: "3.10"
 
  - name: Install release packages
  run: pip install wheel gitchangelog pystache
@@ -106,12 +68,12 @@ jobs:
  run: gitchangelog ${{env.VERSION}} > CHANGELOG.md
 
  - name: Make commit for auto-generated changelog
- uses: EndBug/add-and-commit@v7
+ uses: EndBug/add-and-commit@v9
  env:
  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
  with:
  add: "CHANGELOG.md"
- branch: actions/changelog
+ new_branch: actions/changelog
  message: "!gitchangelog"
 
  - name: Create pull request for the auto generated changelog
@@ -124,4 +86,4 @@ jobs:
  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
 
  - name: Alert developer of open PR
- run: echo "Changelog $PR_URL is ready to be merged by developer."
+ run: echo "Changelog $PR_URL is ready to be merged by developer."
diff --git a/README.md b/README.md
@@ -1,7 +1,4 @@
-# DEVELOPMENT VERSION - WIP - EXPECT BREAKING CHANGES
-___
-
-# Arango-RDF
+# ArangoRDF
 
 [![build](https://github.com/ArangoDB-Community/ArangoRDF/actions/workflows/build.yml/badge.svg?branch=main)](https://github.com/ArangoDB-Community/ArangoRDF/actions/workflows/build.yml)
 [![CodeQL](https://github.com/ArangoDB-Community/ArangoRDF/actions/workflows/analyze.yml/badge.svg?branch=main)](https://github.com/ArangoDB-Community/ArangoRDF/actions/workflows/analyze.yml)
@@ -18,7 +15,7 @@ ___
 <a href="https://www.arangodb.com/" rel="arangodb.com"><img src="https://raw.githubusercontent.com/ArangoDB-Community/ArangoRDF/main/examples/assets/adb_logo.png" width=10%/>
 <a href="https://www.w3.org/RDF/" rel="w3.org/RDF"><img src="https://raw.githubusercontent.com/ArangoDB-Community/ArangoRDF/main/examples/assets/rdf_logo.png" width=7% /></a>
 
-Import/Export RDF graphs with ArangoDB
+Convert RDF Graphs to ArangoDB, and vice-versa.
 
 ## About RDF
 
@@ -47,58 +44,66 @@ pip install git+https://github.com/ArangoDB-Community/ArangoRDF
 Run the full version with Google Colab: <a href="https://colab.research.google.com/github/ArangoDB-Community/ArangoRDF/blob/main/examples/ArangoRDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
 
 ```py
+from rdflib import Graph
 from arango import ArangoClient
 from arango_rdf import ArangoRDF
 
-db = ArangoClient(hosts="http://localhost:8529").db(
- "rdf", username="root", password="openSesame"
-)
+db = ArangoClient(hosts="http://localhost:8529").db("_system_", username="root", password="")
+
+adbrdf = ArangoRDF(db)
 
-# Clean up existing data and collections
-if db.has_graph("default_graph"):
- db.delete_graph("default_graph", drop_collections=True, ignore_missing=True)
+g = Graph()
+g.parse("https://raw.githubusercontent.com/stardog-union/stardog-tutorials/master/music/beatles.ttl")
 
-# Initializes default_graph and sets RDF graph identifier (ArangoDB sub_graph)
-# Optional: sub_graph (stores graph name as the 'graph' attribute on all edges in Statement collection)
-# Optional: default_graph (name of ArangoDB Named Graph, defaults to 'default_graph',
-# is root graph that contains all collections/relations)
-adb_rdf = ArangoRDF(db, sub_graph="http://data.sfgov.org/ontology") 
-config = {"normalize_literals": False} # default: False
+# RDF to ArangoDB
+###################################################################################
 
-# RDF Import
-adb_rdf.init_rdf_collections(bnode="Blank")
+# 1.1: RDF-Topology Preserving Transformation (RPT)
+adbrdf.rdf_to_arangodb_by_rpt("Beatles", g, overwrite_graph=True)
 
-# Start with importing the ontology
-adb_graph = adb_rdf.import_rdf("./examples/data/airport-ontology.owl", format="xml", config=config, save_config=True)
+# 1.2: Property Graph Transformation (PGT) 
+adbrdf.rdf_to_arangodb_by_pgt("Beatles", g, overwrite_graph=True)
 
-# Next, let's import the actual graph data
-adb_graph = adb_rdf.import_rdf(f"./examples/data/sfo-aircraft-partial.ttl", format="ttl", config=config, save_config=True)
+g = adbrdf.load_meta_ontology(g)
 
+# 1.3: RPT w/ Graph Contextualization
+adbrdf.rdf_to_arangodb_by_rpt("Beatles", g, contextualize_graph=True, overwrite_graph=True)
 
-# RDF Export
-# WARNING:
-# Exports ALL collections of the database,
-# currently does not account for default_graph or sub_graph
-# Results may vary, minifying may occur
-rdf_graph = adb_rdf.export_rdf(f"./examples/data/rdfExport.xml", format="xml")
+# 1.4: PGT w/ Graph Contextualization
+adbrdf.rdf_to_arangodb_by_pgt("Beatles", g, contextualize_graph=True, overwrite_graph=True)
 
-# Drop graph and ALL documents and collections to test import from exported data
-if db.has_graph("default_graph"):
- db.delete_graph("default_graph", drop_collections=True, ignore_missing=True)
+# 1.5: PGT w/ ArangoDB Document-to-Collection Mapping Exposed
+adb_mapping = adbrdf.build_adb_mapping_for_pgt(g)
+print(adb_mapping.serialize())
+adbrdf.rdf_to_arangodb_by_pgt("Beatles", g, adb_mapping, contextualize_graph=True, overwrite_graph=True)
 
-# Re-initialize our RDF Graph
-# Initializes default_graph and sets RDF graph identifier (ArangoDB sub_graph)
-adb_rdf = ArangoRDF(db, sub_graph="http://data.sfgov.org/ontology")
+# ArangoDB to RDF
+###################################################################################
 
-adb_rdf.init_rdf_collections(bnode="Blank")
+# Start from scratch!
+g = Graph()
+g.parse("https://raw.githubusercontent.com/stardog-union/stardog-tutorials/master/music/beatles.ttl")
+adbrdf.rdf_to_arangodb_by_pgt("Beatles", g, overwrite_graph=True)
 
-config = adb_rdf.get_config_by_latest() # gets the last config saved
-# config = adb_rdf.get_config_by_key_value('graph', 'music')
-# config = adb_rdf.get_config_by_key_value('AnyKeySuppliedInConfig', 'SomeValue')
+# 2.1: Via Graph Name
+g2, adb_mapping_2 = adbrdf.arangodb_graph_to_rdf("Beatles", Graph())
 
-# Re-import Exported data
-adb_graph = adb_rdf.import_rdf(f"./examples/data/rdfExport.xml", format="xml", config=config)
+# 2.2: Via Collection Names
+g3, adb_mapping_3 = adbrdf.arangodb_collections_to_rdf(
+ "Beatles",
+ Graph(),
+ v_cols={"Album", "Band", "Class", "Property", "SoloArtist", "Song"},
+ e_cols={"artist", "member", "track", "type", "writer"},
+)
+
+print(len(g2), len(adb_mapping_2))
+print(len(g3), len(adb_mapping_3))
 
+print('--------------------')
+print(g2.serialize())
+print('--------------------')
+print(adb_mapping_2.serialize())
+print('--------------------')
 ```
 
 ## Development & Testing
@@ -119,3 +124,75 @@ def pytest_addoption(parser):
  parser.addoption("--password", action="store", default="")
 ```
 
+## Additional Info: RDF to ArangoDB
+
+RDF-to-ArangoDB functionality has been implemented using concepts described in the paper *[Transforming RDF-star to Property Graphs: A Preliminary Analysis of Transformation Approaches](https://arxiv.org/abs/2210.05781)*.
+
+In other words, `ArangoRDF` offers 2 RDF-to-ArangoDB transformation methods:
+1. RDF-topology Preserving Transformation (RPT): `ArangoRDF.rdf_to_arangodb_by_rpt()`
+2. Property Graph Transformation (PGT): `ArangoRDF.rdf_to_arangodb_by_pgt()`
+
+RPT preserves the RDF Graph structure by transforming each RDF Statement into an ArangoDB Edge.
+
+PGT on the other hand ensures that Datatype Property Statements are mapped as ArangoDB Document Properties.
+
+```ttl
+@prefix ex: <http://example.org/> .
+@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
+ex:book ex:publish_date "1963-03-22"^^xsd:date .
+ex:book ex:pages "100"^^xsd:integer .
+ex:book ex:cover 20 .
+ex:book ex:index 55 .
+```
+
+| RPT | PGT |
+|:-------------------------:|:-------------------------:|
+| ![image](https://user-images.githubusercontent.com/43019056/232347662-ab48ebfb-e215-4aff-af28-a5915414a8fd.png) | ![image](https://user-images.githubusercontent.com/43019056/232347681-c899ef09-53c7-44de-861e-6a98d448b473.png) |
+
+--------------------
+### RPT
+
+
+The `ArangoRDF.rdf_to_arangodb_by_rpt` method will store the RDF Resources of your RDF Graph under the following ArangoDB Collections:
+
+ - {graph_name}_URIRef: The Document collection for `rdflib.term.URIRef` resources.
+ - {graph_name}_BNode: The Document collection for`rdflib.term.BNode` resources.
+ - {graph_name}_Literal: The Document collection for `rdflib.term.Literal` resources.
+ - {graph_name}_Statement: The Edge collection for all triples/quads.
+
+--------------------
+### PGT
+
+In contrast to RPT, the `ArangoRDF.rdf_to_arangodb_by_pgt` method will rely on the nature of the RDF Resource/Statement to determine which ArangoDB Collection it belongs to. This is referred as the **ArangoDB Collection Mapping Process**. This process relies on 2 fundamental URIs:
+
+1) `<http://www.arangodb.com/collection>` (adb:collection)
+ - Any RDF Statement of the form `<http://example.com/Bob> <adb:collection> "Person"` will map the Subject to the ArangoDB "Person" document collection.
+
+2) `<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>` (rdf:type)
+ - This strategy is divided into 3 cases:
+
+ 1. If an RDF Resource only has one `rdf:type` statement,
+ then the local name of the RDF Object is used as the ArangoDB
+ Document Collection name. For example,
+ `<http://example.com/Bob> <rdf:type> <http://example.com/Person>`
+ would create an JSON Document for `<http://example.com/Bob>`,
+ and place it under the `Person` Document Collection.
+ NOTE: The RDF Object will also have its own JSON Document
+ created, and will be placed under the "Class"
+ Document Collection.
+
+ 2. If an RDF Resource has multiple `rdf:type` statements,
+ with some (or all) of the RDF Objects of those statements
+ belonging in an `rdfs:subClassOf` Taxonomy, then the
+ local name of the "most specific" Class within the Taxonomy is
+ used (i.e the Class with the biggest depth). If there is a
+ tie between 2+ Classes, then the URIs are alphabetically
+ sorted & the first one is picked.
+
+ 3. If an RDF Resource has multiple `rdf:type` statements, with none
+ of the RDF Objects of those statements belonging in an
+ `rdfs:subClassOf` Taxonomy, then the URIs are
+ alphabetically sorted & the first one is picked. The local
+ name of the selected URI will be designated as the Document
+ collection for that Resource.
+--------------------