Skip to content

Commit

Permalink
ArangoRDF Overhaul: 0.1.0 (#15)
Browse files Browse the repository at this point in the history
* new: test suite & test data

* update: repo config

* new: arango_rdf overhaul checkpoint

* temp: base ontology files

location TBD

* new: `flake8` & `mypy` workflows

* fix: black, flake, mypy

* cleanup

* temp: disable black worflow

* fix: add flake & mypy dependency

* fix: add `rich` dependency

* temp: disable `mypy` workflow

getting inconsistent `mypy` results between local environment & Github Actions environment

* enable: black, mypy

* cleanup: `arango_rdf`

formatting fixes, mypy fixes, docstring updates, general code cleanup

* black: test_main

* update: setup files

* update: test_pgt_case_3_2

addresses all **list_conversion** parameter cases

* update: tests

* misc: pragma no cover

* fix: test assertions

* update: test_rpt_basic_cases

* cleanup: main

* new: `rich` Live Group progress bars, `batch_size` parameter, code cleanup

* update: `rich` trackers in utils

* new: `RDFLists` typing

* new: ignore E266 flake8

* misc: line breaks

* update: `process_rpt_term`, pragma no cover

* new: case 7 prototype

* update 6.trig

* cleanup utils

* cleanup

* variable renaming, cleanup

* cleanup: test data

* rework: test suite

* remove: examples/data

* remove: arango_rdf/ontologies

* new: arango_rdf/meta

* checkpoint: arango_rdf

* fix: isort

* fix: compare_graphs

* temp fix: mypy

* new: fraud detection & imdb tests

* checkpoint: main.py

* fix: isort

* fix: isort (again)

* new: meta files

switching to `trig` format

* checkpoint: tests

* checkpoint: arango_rdf

working on adb mapping functionality

* checkpoint: tests

* checkpoint: arango_rdf

* cleanup: tests

* checkpoint: arango_rdf

* update: test cases

* cleanup: arango_rdf

* fix: rpt case 5

* cleanup: tests

* new: cityhash dependency

* cleanup & docstrings: arango_rdf

flake8 will fail

* fix: flake8

autopep8 & yapf did not work, manual fix was required

* fix: pgt case 6

* new: __build_subclass_tree() and __identify_best_class()

* update: Tree.show()

* cleanup main

* new: dc.trig & xsd.trig starter files

only adding the nodes that are referenced by the other ontologies (OWL, RDF, RDFS) for now

* update: tests

* cleanup: arango_rdf

new `__pgt_add_to_adb_mapping` helper method, add restriction to property type relationship creation if contextualize_graph = True

* fix: pgt case 2_4

* more cleanup: arango_rdf

* new: load RDF Predicates regardless of contextualize_graph value (PGT only)

* update: test_adb_native_graph_to_rdf

* attempt fix: missing coverage on L922

coveralls seems to think this line is not covered by tests...

* Update README.md

* update docstrings

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* fix: flake8

* Update README.md

* new: notebook overhaul baseline

* fix: process_val_as_string

* remove: unused func

* fix: p_already_has_dr

* new: __get_literal_val

* update: __get_literal_val

* fix: subgraph names

* cp: adb_key_uri

* cleanup: arango_rdf

* update: meta trig files

* cleanup: arango_rdf

* update: tests

* more cleanup

* fix: flake8

* new: ArangoRDFController

* fix: isort

* new: use_async (rdf to arangodb)

* cleanup

* update test params

* update: test case 7

* cleanup: insert_adb_docs

* update: tests

* cleanup

* new: ArangoRDF.ipynb output file

* revert: d2277fa

* new: game of thrones dump

* update: tests

* cp: arango_rdf

* update notebook

* new: cases 8-15 in notebook

* new: rdf-star support for rpt

* Revert "new: rdf-star support for rpt"

This reverts commit 2a0ae04.

* checkpoint

rdf-star support prototyping,

* cleanup: adb to rdf

* new: rdf_statement_blacklist

* discard "List" collection for pgt

* new: __get_adb_edge_key

* cleanup

* checkpoint

* cleanup

* new: rdf star cases (8 to 15)

* new: individualize RPT tests

* Update ArangoRDF.ipynb

* cleanup

* new: hash adb edge ids

* update: rdf-star support workaround

* new: test cases 8-15 (pgt)

* update notebook

* cleanup

* actions: use ArangoDB 3.11

* fix notebook

* cleanup

* Update setup.py

* new: design doc

template used: https://github.com/arangodb/documents/blob/master/DesignDocuments/DesignDocumentTemplate.md

* new: simplify_reified_triples flag

* new: keyify_literals (rpt)

minor cleanup

* rework: batch_size (adb to rdf)

* use batch_size in tests

(adb to rdf & rdf to adb)

* new: adb_key URI test case

* cleanup based on feedback

* fix: mypy

* update build workflow

* update release workflow

* cleanup, todo comments

* swap python 3.7 for 3.12

* cleanup tests (case 1 & 6)

* cleanup

* migrate to `pyproject.toml`

* fix lint

* fix mypy

* flake8 extend ignore

trying to workaround 3.12 builds: https://github.com/ArangoDB-Community/ArangoRDF/actions/runs/6856708733/job/18644393745?pr=15
  • Loading branch information
aMahanna authored Dec 4, 2023
1 parent 3ac4b7e commit 3ba3d0e
Show file tree
Hide file tree
Showing 127 changed files with 10,615 additions and 1,631 deletions.
13 changes: 8 additions & 5 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
name: build
on:
workflow_dispatch:
push:
branches: [ main ]
pull_request:
push:
branches: [ main ]
env:
PACKAGE_DIR: arango_rdf
Expand All @@ -13,7 +12,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python: ["3.7", "3.8", "3.9"]
python: ["3.8", "3.9", "3.10", "3.11", "3.12"]
name: Python ${{ matrix.python }}
steps:
- uses: actions/checkout@v2
Expand All @@ -22,17 +21,21 @@ jobs:
with:
python-version: ${{ matrix.python }}
- name: Set up ArangoDB Instance via Docker
run: docker create --name adb -p 8529:8529 -e ARANGO_ROOT_PASSWORD= arangodb/arangodb:3.9.1
run: docker create --name adb -p 8529:8529 -e ARANGO_ROOT_PASSWORD= arangodb/arangodb
- name: Start ArangoDB Instance
run: docker start adb
- name: Setup pip
run: python -m pip install --upgrade pip setuptools wheel
run: pip install --upgrade pip setuptools wheel
- name: Install packages
run: pip install .[dev]
- name: Run black
run: black --check --verbose --diff --color ${{env.PACKAGE_DIR}} ${{env.TESTS_DIR}}
- name: Run flake8
run: flake8 ${{env.PACKAGE_DIR}} ${{env.TESTS_DIR}}
- name: Run isort
run: isort --check --profile=black ${{env.PACKAGE_DIR}} ${{env.TESTS_DIR}}
- name: Run mypy
run: mypy ${{env.PACKAGE_DIR}} ${{env.TESTS_DIR}}
- name: Run pytest
run: pytest --cov=${{env.PACKAGE_DIR}} --cov-report xml --cov-report term-missing -v --color=yes --no-cov-on-fail --code-highlight=yes
- name: Publish to coveralls.io
Expand Down
66 changes: 14 additions & 52 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,72 +3,34 @@ on:
workflow_dispatch:
release:
types: [published]
env:
PACKAGE_DIR: arango_rdf
TESTS_DIR: tests
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python: ["3.7", "3.8", "3.9"]
name: Python ${{ matrix.python }}
steps:
- uses: actions/checkout@v2
- name: Setup Python ${{ matrix.python }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python }}
- name: Set up ArangoDB Instance via Docker
run: docker create --name adb -p 8529:8529 -e ARANGO_ROOT_PASSWORD= arangodb/arangodb:3.9.1
- name: Start ArangoDB Instance
run: docker start adb
- name: Setup pip
run: python -m pip install --upgrade pip setuptools wheel
- name: Install packages
run: pip install .[dev]
- name: Run black
run: black --check --verbose --diff --color ${{env.PACKAGE_DIR}} ${{env.TESTS_DIR}}
- name: Run isort
run: isort --check --profile=black ${{env.PACKAGE_DIR}} ${{env.TESTS_DIR}}
- name: Run pytest
run: pytest --cov=${{env.PACKAGE_DIR}} --cov-report xml --cov-report term-missing -v --color=yes --no-cov-on-fail --code-highlight=yes
- name: Publish to coveralls.io
if: matrix.python == '3.8'
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: coveralls --service=github

release:
needs: build
runs-on: ubuntu-latest
name: Release package
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4

- name: Fetch complete history for all tags and branches
run: git fetch --prune --unshallow

- name: Setup python
uses: actions/setup-python@v2
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.10"

- name: Install release packages
run: pip install setuptools wheel twine setuptools-scm[toml]

- name: Install dependencies
run: pip install .[dev]

- name: Build distribution
run: python setup.py sdist bdist_wheel

- name: Publish to PyPI Test
- name: Publish to Test PyPi
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD_TEST }}
run: twine upload --repository testpypi dist/* #--skip-existing
- name: Publish to PyPI

- name: Publish to PyPi
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
Expand All @@ -79,7 +41,7 @@ jobs:
runs-on: ubuntu-latest
name: Update Changelog
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
with:
fetch-depth: 0

Expand All @@ -91,10 +53,10 @@ jobs:
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

- name: Setup python
uses: actions/setup-python@v2
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.10"

- name: Install release packages
run: pip install wheel gitchangelog pystache
Expand All @@ -106,12 +68,12 @@ jobs:
run: gitchangelog ${{env.VERSION}} > CHANGELOG.md

- name: Make commit for auto-generated changelog
uses: EndBug/add-and-commit@v7
uses: EndBug/add-and-commit@v9
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
add: "CHANGELOG.md"
branch: actions/changelog
new_branch: actions/changelog
message: "!gitchangelog"

- name: Create pull request for the auto generated changelog
Expand All @@ -124,4 +86,4 @@ jobs:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

- name: Alert developer of open PR
run: echo "Changelog $PR_URL is ready to be merged by developer."
run: echo "Changelog $PR_URL is ready to be merged by developer."
159 changes: 118 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
# DEVELOPMENT VERSION - WIP - EXPECT BREAKING CHANGES
___

# Arango-RDF
# ArangoRDF

[![build](https://github.com/ArangoDB-Community/ArangoRDF/actions/workflows/build.yml/badge.svg?branch=main)](https://github.com/ArangoDB-Community/ArangoRDF/actions/workflows/build.yml)
[![CodeQL](https://github.com/ArangoDB-Community/ArangoRDF/actions/workflows/analyze.yml/badge.svg?branch=main)](https://github.com/ArangoDB-Community/ArangoRDF/actions/workflows/analyze.yml)
Expand All @@ -18,7 +15,7 @@ ___
<a href="https://www.arangodb.com/" rel="arangodb.com"><img src="https://raw.githubusercontent.com/ArangoDB-Community/ArangoRDF/main/examples/assets/adb_logo.png" width=10%/>
<a href="https://www.w3.org/RDF/" rel="w3.org/RDF"><img src="https://raw.githubusercontent.com/ArangoDB-Community/ArangoRDF/main/examples/assets/rdf_logo.png" width=7% /></a>

Import/Export RDF graphs with ArangoDB
Convert RDF Graphs to ArangoDB, and vice-versa.

## About RDF

Expand Down Expand Up @@ -47,58 +44,66 @@ pip install git+https://github.com/ArangoDB-Community/ArangoRDF
Run the full version with Google Colab: <a href="https://colab.research.google.com/github/ArangoDB-Community/ArangoRDF/blob/main/examples/ArangoRDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

```py
from rdflib import Graph
from arango import ArangoClient
from arango_rdf import ArangoRDF

db = ArangoClient(hosts="http://localhost:8529").db(
"rdf", username="root", password="openSesame"
)
db = ArangoClient(hosts="http://localhost:8529").db("_system_", username="root", password="")

adbrdf = ArangoRDF(db)

# Clean up existing data and collections
if db.has_graph("default_graph"):
db.delete_graph("default_graph", drop_collections=True, ignore_missing=True)
g = Graph()
g.parse("https://raw.githubusercontent.com/stardog-union/stardog-tutorials/master/music/beatles.ttl")

# Initializes default_graph and sets RDF graph identifier (ArangoDB sub_graph)
# Optional: sub_graph (stores graph name as the 'graph' attribute on all edges in Statement collection)
# Optional: default_graph (name of ArangoDB Named Graph, defaults to 'default_graph',
# is root graph that contains all collections/relations)
adb_rdf = ArangoRDF(db, sub_graph="http://data.sfgov.org/ontology")
config = {"normalize_literals": False} # default: False
# RDF to ArangoDB
###################################################################################

# RDF Import
adb_rdf.init_rdf_collections(bnode="Blank")
# 1.1: RDF-Topology Preserving Transformation (RPT)
adbrdf.rdf_to_arangodb_by_rpt("Beatles", g, overwrite_graph=True)

# Start with importing the ontology
adb_graph = adb_rdf.import_rdf("./examples/data/airport-ontology.owl", format="xml", config=config, save_config=True)
# 1.2: Property Graph Transformation (PGT)
adbrdf.rdf_to_arangodb_by_pgt("Beatles", g, overwrite_graph=True)

# Next, let's import the actual graph data
adb_graph = adb_rdf.import_rdf(f"./examples/data/sfo-aircraft-partial.ttl", format="ttl", config=config, save_config=True)
g = adbrdf.load_meta_ontology(g)

# 1.3: RPT w/ Graph Contextualization
adbrdf.rdf_to_arangodb_by_rpt("Beatles", g, contextualize_graph=True, overwrite_graph=True)

# RDF Export
# WARNING:
# Exports ALL collections of the database,
# currently does not account for default_graph or sub_graph
# Results may vary, minifying may occur
rdf_graph = adb_rdf.export_rdf(f"./examples/data/rdfExport.xml", format="xml")
# 1.4: PGT w/ Graph Contextualization
adbrdf.rdf_to_arangodb_by_pgt("Beatles", g, contextualize_graph=True, overwrite_graph=True)

# Drop graph and ALL documents and collections to test import from exported data
if db.has_graph("default_graph"):
db.delete_graph("default_graph", drop_collections=True, ignore_missing=True)
# 1.5: PGT w/ ArangoDB Document-to-Collection Mapping Exposed
adb_mapping = adbrdf.build_adb_mapping_for_pgt(g)
print(adb_mapping.serialize())
adbrdf.rdf_to_arangodb_by_pgt("Beatles", g, adb_mapping, contextualize_graph=True, overwrite_graph=True)

# Re-initialize our RDF Graph
# Initializes default_graph and sets RDF graph identifier (ArangoDB sub_graph)
adb_rdf = ArangoRDF(db, sub_graph="http://data.sfgov.org/ontology")
# ArangoDB to RDF
###################################################################################

adb_rdf.init_rdf_collections(bnode="Blank")
# Start from scratch!
g = Graph()
g.parse("https://raw.githubusercontent.com/stardog-union/stardog-tutorials/master/music/beatles.ttl")
adbrdf.rdf_to_arangodb_by_pgt("Beatles", g, overwrite_graph=True)

config = adb_rdf.get_config_by_latest() # gets the last config saved
# config = adb_rdf.get_config_by_key_value('graph', 'music')
# config = adb_rdf.get_config_by_key_value('AnyKeySuppliedInConfig', 'SomeValue')
# 2.1: Via Graph Name
g2, adb_mapping_2 = adbrdf.arangodb_graph_to_rdf("Beatles", Graph())

# Re-import Exported data
adb_graph = adb_rdf.import_rdf(f"./examples/data/rdfExport.xml", format="xml", config=config)
# 2.2: Via Collection Names
g3, adb_mapping_3 = adbrdf.arangodb_collections_to_rdf(
"Beatles",
Graph(),
v_cols={"Album", "Band", "Class", "Property", "SoloArtist", "Song"},
e_cols={"artist", "member", "track", "type", "writer"},
)

print(len(g2), len(adb_mapping_2))
print(len(g3), len(adb_mapping_3))

print('--------------------')
print(g2.serialize())
print('--------------------')
print(adb_mapping_2.serialize())
print('--------------------')
```

## Development & Testing
Expand All @@ -119,3 +124,75 @@ def pytest_addoption(parser):
parser.addoption("--password", action="store", default="")
```

## Additional Info: RDF to ArangoDB

RDF-to-ArangoDB functionality has been implemented using concepts described in the paper *[Transforming RDF-star to Property Graphs: A Preliminary Analysis of Transformation Approaches](https://arxiv.org/abs/2210.05781)*.

In other words, `ArangoRDF` offers 2 RDF-to-ArangoDB transformation methods:
1. RDF-topology Preserving Transformation (RPT): `ArangoRDF.rdf_to_arangodb_by_rpt()`
2. Property Graph Transformation (PGT): `ArangoRDF.rdf_to_arangodb_by_pgt()`

RPT preserves the RDF Graph structure by transforming each RDF Statement into an ArangoDB Edge.

PGT on the other hand ensures that Datatype Property Statements are mapped as ArangoDB Document Properties.

```ttl
@prefix ex: <http://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
ex:book ex:publish_date "1963-03-22"^^xsd:date .
ex:book ex:pages "100"^^xsd:integer .
ex:book ex:cover 20 .
ex:book ex:index 55 .
```

| RPT | PGT |
|:-------------------------:|:-------------------------:|
| ![image](https://user-images.githubusercontent.com/43019056/232347662-ab48ebfb-e215-4aff-af28-a5915414a8fd.png) | ![image](https://user-images.githubusercontent.com/43019056/232347681-c899ef09-53c7-44de-861e-6a98d448b473.png) |

--------------------
### RPT


The `ArangoRDF.rdf_to_arangodb_by_rpt` method will store the RDF Resources of your RDF Graph under the following ArangoDB Collections:

- {graph_name}_URIRef: The Document collection for `rdflib.term.URIRef` resources.
- {graph_name}_BNode: The Document collection for`rdflib.term.BNode` resources.
- {graph_name}_Literal: The Document collection for `rdflib.term.Literal` resources.
- {graph_name}_Statement: The Edge collection for all triples/quads.

--------------------
### PGT

In contrast to RPT, the `ArangoRDF.rdf_to_arangodb_by_pgt` method will rely on the nature of the RDF Resource/Statement to determine which ArangoDB Collection it belongs to. This is referred as the **ArangoDB Collection Mapping Process**. This process relies on 2 fundamental URIs:

1) `<http://www.arangodb.com/collection>` (adb:collection)
- Any RDF Statement of the form `<http://example.com/Bob> <adb:collection> "Person"` will map the Subject to the ArangoDB "Person" document collection.

2) `<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>` (rdf:type)
- This strategy is divided into 3 cases:

1. If an RDF Resource only has one `rdf:type` statement,
then the local name of the RDF Object is used as the ArangoDB
Document Collection name. For example,
`<http://example.com/Bob> <rdf:type> <http://example.com/Person>`
would create an JSON Document for `<http://example.com/Bob>`,
and place it under the `Person` Document Collection.
NOTE: The RDF Object will also have its own JSON Document
created, and will be placed under the "Class"
Document Collection.

2. If an RDF Resource has multiple `rdf:type` statements,
with some (or all) of the RDF Objects of those statements
belonging in an `rdfs:subClassOf` Taxonomy, then the
local name of the "most specific" Class within the Taxonomy is
used (i.e the Class with the biggest depth). If there is a
tie between 2+ Classes, then the URIs are alphabetically
sorted & the first one is picked.

3. If an RDF Resource has multiple `rdf:type` statements, with none
of the RDF Objects of those statements belonging in an
`rdfs:subClassOf` Taxonomy, then the URIs are
alphabetically sorted & the first one is picked. The local
name of the selected URI will be designated as the Document
collection for that Resource.
--------------------
Loading

0 comments on commit 3ba3d0e

Please sign in to comment.