###Description
Semantic Publishing Instance Matching Benchmark (SPIMBENCH), is a benchmark for the assessment of Instance Matching techniques for RDF data with an associated schema. Essentially, SPIMBENCH implements: (i) a set of test cases based on transformations that distinguish different types of matching entities, (ii) a scalable data generator, (iii) a gold standard documenting the matches that IM systems should find, and (iv) evaluation metrics.
###Build
Apache Ant build tool is required. Use one of the following tasks :
#to build a standard version of the benchmark, compliant to SPARQL 1.1 standard
ant build-base-querymix-standard
#to build a standard version of the benchmark, compliant to SPARQL 1.1 standard with extended query mix
ant build-full-querymix-standard
#to build a version of the benchmark customized for Virtuoso Database
ant build-base-querymix-virtuoso
#to build a version of the benchmark customized for Virtuoso Database with extended query mix
ant build-full-querymix-virtuoso
Result of the build process is saved to the distribution folder (dist/) :
- semantic_publishing_benchmark-*.jar
- semantic_publishing_benchmark_reference_knowledge_data.zip
- definitions.properties
- test.properties
- readme.txt
###Install
Required dependencies for RESCAL :
-
Numpy >= 1.3
-
SciPy >= 0.7
-
WordNet - is also required for lexical transformations
Required configuration files :
- test.properties - contains configuration parameters for configuring the benchmark driver
- definitions.properties - contains values of pre-allocated parameters used by the benchmark. Not to be modified by the regular benchmark user
Extract from file semantic_publishing_benchmark_reference_knowledge_data.zip following :
- data/ - folder containing required reference knowledge (ontologies and data) and query templates
Extract from additinal reference datasets (see project ldbc_semanticpub_bm_additional_datasets). This is an optional step.
- Files of type .ttl and save to data/datasets folder
All items should be saved in same location with the benchmark jar file.
###Configure
-
RDF Repository configuration
- Use RDFS rule-set
- Enable context indexing (if available)
- Enable geo-spatial indexing (if available)
-
Benchmark driver configuration. All configuration parameters are stored in properties file (test.properties)
- ontologiesPath - path to ontologies from reference knowledge, e.g. ./data/ontologies
- referenceDatasetsPath - path to data from reference knowledge, e.g. ./data/datasets
- creativeWorksPath - path to generated data, e.g. ./data/generated
- queriesPath - path to query templates, e.g. ./data/sparql
- definitionsPath - path to definitions.properties file, e.g. ./definitions.properties
- endpointURL - URL of SPARQL endpoint provided by the RDF database, e.g. http://localhost:8080/openrdf-sesame/repositories/ldbc
- endpointUpdateURL - URL of SPARQL endpoint for update operations, e.g. http://localhost:8080/openrdf-sesame/repositories/ldbc1/statements
- wordnetPath - WordNet path e.g. C:/Program Files/WordNet/2.1/dict/
- ***rescalSampling - files that are going to be used for the sampling phase
- datasetSize - size of generated data (triples). Data-generator uses this parameter
- allowSizeAdjustmentsOnDataModels - allows the data generator to adjust the amount of correlations, clusterings and randomly generated models (Creative Works) in relation to the 'datasetSize', thus keeping a ratio of 1/3 for each in generated data. Default value is true
- generatedTriplesPerFile - number of triples per generated file. Used to split the data generation into a number of files
- queryTimeoutSeconds - query timeout in seconds
- systemQueryTimeoutSeconds - system queries timeout, default value 1h
- warmupPeriodSeconds - warmup period in seconds
- benchmarkRunPeriodSeconds - benchmark period in seconds
- generateCreativeWorksFormat - serialization format for generated data. Available options : TriG, TriX, N-Triples, N-Quads, N3, RDF/XML, RDF/JSON, Turtle. Use exact names.
- aggregationAgents - number of aggregation agents that will execute a mix of aggregation queries simultaneously
- editorialAgents - number of editorial agents that will execute a mix of update operations simultaneously
- dataGeneratorWorkers - number of worker threads used by the data generator to produce data
- generatorRandomSeed - use it to set the random set for the data generator (default value is 0). e.g. in cases when several benchmark drivers are started in separate processes to generate data - to be used with creativeWorkNextId parameter
- creativeWorkNextId - set the next ID for the data generator of Creative Works. When running the benchmark driver in separate processes, to guarantee that generated creative works will not overlap their IDs. e.g. for generating 50M dataset, expected number of Creative Works is ~2.5M and next ID should start at that value
- creativeWorksInfo - file name, that will be saved in creativeWorksPath and will contain system info about the generated dataset, e.g. interesting entities, etc.
- querySubstitutionParameters - number substitution parameters that will be generated for each query
- benchmarkByQueryRuns - sets the amount of aggregate queries which the benchmark phase will execute. If value is greater than zero then parameter 'benchmarkRunPeriodSeconds' is ignored. e.g. if set to 100, benchmark will measure the time to execute 100 aggregate operations
-
Benchmark Phases (test.properties) One, some or all phases can be enabled and will run in the sequence listed below. Running first three phases is mandatory with optionally enabling fourth one (loadCreativeWorks) - for the case when generated data will not be loaded manually into the database.
- loadOntologies - populate the RDF database with required ontologies (from reference knowledge). It can be done manually by uploading all .ttl files located at : /data/ontologies
- adjustRefDatasetsSizes - optional phase, if reference dataset files exist with the extension '.adjustablettl', then for each, a new .ttl file is created with adjusted size depending on the selected size of data to be generated (parameter 'datasetSize' in test.properties file).
- loadReferenceDatasets - populate the RDF database with required reference data (from reference knowledge). It can be done manually by uploading all .ttl files located at : /data/datasets
- generateCreativeWorks - generate the data used for benchmarking. Data is saved to files of defined size (generatedTriplesPerFile) and total number of triples (datasetSize). Requires phases : loadOntologies, loadDatasets.
- loadCreativeWorks - load generated data from previous phase into RDF database. Optional phase, verified from N-Quads serialization format
- generateQuerySubstitutionParameters - Controls generation of query substitution parameters which later can be used during the warmup and benchmark phases. For each query a substitution parameters file is created and saved into 'creativeWorksPath' location. If no files are found at that location, queries executed during warmup and benchmark phases will use randomly generated parameters. Requires phases : loadOntologies, loadDatasets, generateCreativeWorks, loadCreativeWorks.
- validateQueryResults - validate correctness of results for editorial and aggregate operations against a validation dataset. Requires phases : loadOntologies, loadDatasets.
- warmUp - runs the aggregation agents for warmupPeriodSeconds seconds, results are not collected. Requires phases : loadOntologies, loadDatasets, generateCreativeWorks (optional), loadCreativeWorks (optional).
- runBenchmark - runs the benchmark for benchmarkRunPeriodSeconds seconds, results are collected. Editorial and aggregation agents are run simultaneously. Requires phases : loadOntologies, loadDatasets, generateCreativeWorks (optional), loadCreativeWorks (optional).
- runBenchmarkOnlineReplicationAndBackup - benchmark is measuring performance under currently ongoing backup process. Verifies that certain conditions are met such as milestone points at which backup has been started. Requires additional implementation of provided shell script files (/data/enterprise/scripts) for using vendor's specific command for backup. Requires phases : loadOntologies, loadDatasets, generateCreativeWorks (optional), loadCreativeWorks (optional). Also making a full backup prior to running the benchmark for later restore point.
-
Conformance Validation Phase To be run independently on a new repository (using OWL2-RL rule-set). Required phase before running : loadOntologies. No data generation and loading is required.
- checkConformance - runs tests for conformance to the OWL2-RL rules. Requires phase : loadOntologies.
###Run
java -jar semantic_publishing_benchmark-*.jar test.properties
Note: appropriate value for java maximum heap size may be required, e.g. -Xmx8G
###Results Results of the benchmark are saved to three types of log files :
- brief - brief log of executed queries, saved in semantic_publishing_benchmark_queries_brief.log
- detailed - detailed log of executed queries with results, saved in semantic_publishing_benchmark_queries_detailed.log
- summary - editorial and aggregate operations rate, saved in semantic_publishing_benchmark_results.log