Vectograph is an open-source software library for automatically creating a graph structured data from a given tabular data.
Let X be a m by n matrix representing the input tabular, the structured data is created by following these steps:
- Apply QCUT algorithm for each column that has at least min_unique_val_per_column number of unique values.
- Consider
- the i.th row as the i.th concise bounded description of the i.th event.
- the j.th column as the j.th relation/predicate/edge.
- A triple is modeled as event_i -> relation_j -> X_ij.
Assume that we have the first row of fetch_california_housing is
[ 8.3252 41. 6.98412698 1.02380952 322. 2.55555556 37.88 -122.23 ]
Applying the QCUT algorithm with default parameters min_unique_val_per_column=6, num_quantile=5 generates 0.th CBD
<Event_0> <Feature_Category_0> <0_quantile_4> .
<Event_0> <Feature_Category_1> <1_quantile_4> .
<Event_0> <Feature_Category_2> <2_quantile_4> .
<Event_0> <Feature_Category_3> <3_quantile_1> .
<Event_0> <Feature_Category_4> <4_quantile_0> .
<Event_0> <Feature_Category_5> <5_quantile_1> .
<Event_0> <Feature_Category_6> <6_quantile_4> .
<Event_0> <Feature_Category_7> <7_quantile_0> .
that consist of n triples.
<Feature_Category_0>
represents the 0.th relation, i.e., 0.th column, whereas <0_quantile_4>
represents a tail entity
, i.e., the 4.th bin of the 0.th column of the tabular data. . After the data conversion, we store each bin values. For instance, running examples/sklearn_example.py generates Feature_Category_0_Mapping.csv
that indicates
0_quantile_4
corresponds a bin that cover all values greater or equal than 5.10972.
git clone https://github.com/dice-group/Vectograph.git
conda create -n temp python=3.6 # Or be sure that your have Python => 3.6.
conda activate temp
pip install -e .
python -c "import vectograph"
python -m pytest tests
from vectograph.transformers import GraphGenerator
from vectograph.quantizer import QCUT
import pandas as pd
from sklearn import datasets
X, y = datasets.fetch_california_housing(return_X_y=True)
X_transformed = QCUT(min_unique_val_per_column=6, num_quantile=5).transform(pd.DataFrame(X))
# Add prefix
X_transformed.index = 'Event_' + X_transformed.index.astype(str)
kg = GraphGenerator().transform(X_transformed)
for s, p, o in kg:
print(s, p, o)
Create a toy dataset via sklearn. Available datasets: boston, iris, diabetes, digits, wine, and breast_cancer.
python create_toy_data.py --toy_dataset_name "boston"
# Discretize each column having at least 12 unique values into 10 quantiles, otherwise do nothing
python main.py --tabularpath "boston.csv" --kg_name "boston.nt" --num_quantile=10 --min_unique_val_per_column=12
Scripting Vectograph & Knowledge Graph Embeddings at Scale
From a tabular data to knowledge graph embeddings
# (1) Clone the repositories.
git clone https://github.com/dice-group/DAIKIRI-Embedding.git
git clone https://github.com/dice-group/vectograph.git
# (3) Create a virtual enviroment and install the dependicies pertaining to the DAIKIRI-Embedding framework.
conda env create -f DAIKIRI-Embedding/environment.yml
conda activate daikiri
# (4) Install dependencies of the vectograph framework.
pip install -e vectograph/.
# (5) Create a knowledge graph by using an example dataset from sklearn.datasets.fetch_california_housing.html
python vectograph/create_toy_data.py --toy_dataset_name "wine"
python vectograph/main.py --tabularpath "wine.csv" --kg_name "train.txt" --num_quantile=10 --min_unique_val_per_column=12
# (6) Generate Embeddings
python DAIKIRI-Embedding/main.py --path_dataset_folder '.' --model 'ConEx' > conex_emb.log
# (7) Log file contains all relevant information
cat conex_emb.log
# Result: A folder named with current time created that contains
# info.log, ConEx_entity_embeddings.csv, ConEx_relation_embeddings.csv, etc.
If you really like this framework and want to cite it in your work, feel free to
@inproceedings{demir2021convolutional,
title={Convolutional Complex Knowledge Graph Embeddings},
author={Caglar Demir and Axel-Cyrille Ngonga Ngomo},
booktitle={Eighteenth Extended Semantic Web Conference - Research Track},
year={2021},
url={https://openreview.net/forum?id=6T45-4TFqaX}}
For any further questions, please contact: caglar.demir@upb.de