Step-by-step: RDF4J with Scalability-10K Workload

By Theofilos Ioannidis (tioannid [at] di [dot] uoa [dot] gr), created on , last updated on


In this section, we provide instructions to the user in order to run an experiment with the RDF4J module against the smallest version of the Scalability workload which is part of the Geographica 2[1] benchmark. For this example, we used the first installation method of GeoRDFBench Framework with pre-built binaries at the /data/GeoRDFBench path.

Scalability Workload

GeoRDFBench Framework comes prebundled with JSON specifications of all Geographica 2 benchmark components, such as datasets, querysets, execution model, etc. Apart from the detailed representation of a benchmark components where each one is independent from the other, GeoRDFBench offers a more compact representation, as a single workload JSON specification file. This specification combines the specifications of dataset, queryset and execution. The Scalability workload features a queryset of 3 GeoSPARQL queries against a selection of 6 datasets of increasingly bigger size: 10K, 100K, 1M, 10M, 100M, 500M triples. The scalability 10K workload compact specification is located in:

/data/GeoRDFBench$ tree -L 1 json_defs/workloads/
    json_defs/workloads/
    ├── censusmacrogeocodingWLoriginal.json
    ├── querysets
    ├── rwmacrocomputestatisticsWLoriginal.json
    ├── rwmacromapsearchWLoriginal.json
    ├── rwmacrorapidmappingWLoriginal.json
    ├── rwmacroreversegeocodingWLoriginal.json
    ├── rwmicroWLoriginal.json
    ├── scalabilityFunc100K_WLoriginal.json
    ├── scalabilityFunc100M_WLoriginal.json
    ├── scalabilityFunc10K_WLoriginal.json
    ├── scalabilityFunc10M_WLoriginal.json
    ├── scalabilityFunc1M_WLoriginal.json
    ├── scalabilityFunc500M_WLoriginal.json
    ├── scalabilityPred100K_WLoriginal.json
    ├── scalabilityPred100M_WLoriginal.json
    ├── scalabilityPred10K_WLoriginal.json
    ├── scalabilityPred10M_WLoriginal.json
    ├── scalabilityPred1M_WLoriginal.json
    ├── scalabilityPred500M_WLoriginal.json
    └── syntheticWLoriginal.json

Below (see Figure 1) we can see how path properties from the host and workload specification are combined internally in order to point to the actual N-Triple file /data/Geographica2_Datasets/Scalability/10K/scalability_10K.nt:

Dataset File Actual Path
Fig.1 - Dataset File Actual Path

The above method allows the same workload specification to run against a different host as long as the dataset file(s) are located in the new host's "sourceFileDir".

Scalability Dataset Source Files

/data$ tree -L 1 Geographica2_Datasets/
Geographica2_Datasets/
├── Census
├── PregenSynthetic
├── RealWorldWorkload
├── SyntheticWorkload
└── Scalability
    ├── 10K
    │   └── scalability_10K.nt
    ├── scalability500MRefDS.nt.gz
    └── scalabilityDSGen.sh

The Scalability dataset comprises the scalability500MRefDS.nt.gz reference 500M triples compressed dataset and an accompanying script scalabilityDSGen.sh which helps generate all uncompressed scalability datasets: 10K, 100K, 1M, 10M, 100M, 500M. In the previous listing the 10K uncompressed dataset has been generated already and can be used in our tests. The generation of the uncompressed datasets is handled automatically by the repository generation scripts of the RDF modules. The first time any system creates a repository for scalability 10K workload or dataset, the repository creation script will invoke the scalabilityDSGen.sh script to extract the Scalability/10K/scalability_10K.nt dataset file. All other executions of repository creation scripts by any other system will skip this extraction step. In any case, even its manual use is simple. Assuming that the 10K/scalability_10K.nt did not exist, but we had the reference dataset and script in place, we could issue the following:

/data/Geographica2_Datasets/Scalability$ mkdir 10K; ./scalabilityDSGen.sh scalability500MRefDS.nt.gz 10K > 10K/scalability_10K.nt

The reason we use a compressed reference file and a script to extract chunks of this dataset is twofold: (i) it is network friendly, as it is much easier to download and (ii) it is storage friendly, as uncompressing data is delayed until they are actually needed. The uncompressed reference 500M triple dataset is approximately 95GB, while the compressed version approximate 7GB.

Scalability Execution Specification

The execution specification of a workload describes the manner that the experiment should be conducted. It specifies how many times a query should be executed and whether it should be with cold or warm caches. It specifies a timeout for each query execution, the timeout for clearing caches between executions and a total timeout for the entire experiment. It defines a policy for the action to take when a query times out. The user can also define the aggregate function to use for custom statistics. It can also be used to print the ground queryset instead of running the experiment. The contents of the Scalability Execution specification is listed below, annotated with explanatory comments:

/data/GeoRDFBench$ cat json_defs/executionspecs/scalabilityESoriginal.json 
        {
          "classname" : "gr.uoa.di.rdf.Geographica3.runtime.executionspecs.impl.SimpleES",
          "execTypeReps" : {                                        # each query is executed 7 times = 3 + (1+3)
            "COLD" : 3,                                             # 3 recorded times with cold caches
            "WARM" : 3                                              # 1 unrecorded with cold caches and 3 recorded times with warm caches
          },
          "maxDurationSecsPerQueryRep" : 86400,                     # max exec time for a query is 24 hours
          "maxDurationSecs" : 604800,                               # max experiment exec time is 7*24 hours
          "action" : "RUN",                                         # run the experiment
          "avgFunc" : "QUERY_MEDIAN",                               # use median instead of arithmetic mean
          "onColdFailure" : "SKIP_REMAINING_ALL_QUERY_EXECUTIONS",  # if a cold execution times out, skip all other executions for this query
          "clearCacheDelaymSecs" : 5000                             # wait for 5 secs between query executions, when caches need to be cleared

Max query timeout is high because most systems tested in [1] cannot complete the queries against the largest 500M triple dataset. The timeout for clearing caches is required to be several seconds, since some RDF stores need to restart their application and/or database server components to achieve "clear caches". Each query will get executed 7 times in total (if no timeout occurs!), but in the following manner: (i) COLD RUNS: each query executes and records 3 consecutive times, clearing caches before each execution, (ii) WARM RUNS: each query executes 1 time, clearing caches before this execution, followed by 3 consecutive recorded executions without clearing caches, (iii) if during the COLD runs a query times out, it is blocked from executing any other COLD or WARM runs for this query.

RDF4J

Load Scalability Dataset

Prepare the environment

In order to reduce the number of arguments required for running a system's repository creation script, it is convenient, but not obligatory, to source the following provided bash script and review how the environment is setup:

tioannid@NUC8i7BEH:/data$ cd GeoRDFBench/scripts/
tioannid@NUC8i7BEH:/data/GeoRDFBench/scripts$ source prepareRunEnvironment.sh nuc8i7beh RDF4JSUT "CreateRepo_Scalability10K_RDF4J"
abort: no repository found in '/data/GeoRDFBench/scripts' (.hg not found)
Running script with syntax: source prepareRunEnvironment.sh NUC8I7BEH RDF4JSUT CreateRepo_Scalability10K_RDF4J
tioannid@NUC8i7BEH:/data/GeoRDFBench/scripts$ ./printRunEnvironment.sh 
All SUTs
--------
Environment = NUC8I7BEH
GeographicaScriptsDir = /data/GeoRDFBench/scripts
DatasetBaseDir = /data/Geographica2_Datasets
QuerysetBaseDir = /data/Geographica2_Datasets/QuerySets
ResultsBaseDir = /data
ResultsDirName = #_2023-05-08_RDF4JSUT_CreateRepo_Scalability10K_RDF4J
ActiveSUT = RDF4JSUT
ExperimentResultDir = /data/RDF4JSUT/#_2023-05-08_RDF4JSUT_CreateRepo_Scalability10K_RDF4J
ExperimentDesc = #_2023-05-08_RDF4JSUT_CreateRepo_Scalability10K_RDF4J
CompletionReportDaemonIP = 192.168.1.44
CompletionReportDaemonPort = 3333
ScalabilityGenScriptName = /data/Geographica2_Datasets/Scalability/scalabilityDSGen.sh
ScalabilityGzipRefDSName = /data/Geographica2_Datasets/Scalability/scalability500MRefDS.nt.gz
SystemMemorySizeInGB = 32 GBs
JVM_Xmx = -Xmx24g
...
RDF4J SUT
---------
RDF4JRepoBaseDir = /data/RDF4J_3.7.7_Repos/server
EnableLuceneSail = false
RDF4JLuceneReposPrefix = 
Version = 3.7.7
...

Sourcing the preparation script for repository generation has the added benefit that the user can use the environment variables in subsequent actions without the need to remember various long paths, as we see in the following sections.

Verify whether Scalability 10K dataset exists (optional)
tioannid@NUC8i7BEH:/data/GeoRDFBench/scripts$ ls -lsa $ScalabilityGenScriptName 
4 -rwxr-xr-x 1 tioannid tioannid 1406 Μαΐ   7 19:42 /data/Geographica2_Datasets/Scalability/scalabilityDSGen.sh
tioannid@NUC8i7BEH:/data/GeoRDFBench/scripts$ ls -lsa $ScalabilityGzipRefDSName 
7745592 -rwxrwxrwx 1 tioannid tioannid 7931478121 Νοε  20  2020 /data/Geographica2_Datasets/Scalability/scalability500MRefDS.nt.gz
tioannid@NUC8i7BEH:/data/GeoRDFBench/scripts$ ls -lsa /data/Geographica2_Datasets/Scalability
total 7745604
      4 drwxrwxr-x  2 tioannid tioannid       4096 Μαΐ   8 20:34 .
      4 drwxrwxr-x 11 tioannid tioannid       4096 Φεβ  19 18:56 ..
7745592 -rwxrwxrwx  1 tioannid tioannid 7931478121 Νοε  20  2020 scalability500MRefDS.nt.gz
      4 -rwxr-xr-x  1 tioannid tioannid       1406 Μαΐ   7 19:42 scalabilityDSGen.sh

We can see that the compressed Scalability reference dataset and script are in place and that the 10K dataset has not been extracted yet. If the dataset have not been downloaded yet, the user can download both files in the links provided in this page, section Scalability Dataset Source Files.

Verify if there are any RDF4J repositories present
tioannid@NUC8i7BEH:/data/GeoRDFBench/scripts$ ls -lsa $RDF4JRepoBaseDir/repositories/
total 8
4 drwxrwxr-x 2 tioannid tioannid 4096 Μαΐ   8 20:31 .
4 drwxrwxr-x 3 tioannid tioannid 4096 Μαΐ   6 21:24 ..

We can see that there are no existing RDF4J repositories, yet, under the RDF4J base repo directory.

Create the Scalability-10K RDF4J repository

Since we used the preparation script earlier, we only need to specify one argument to the RDF4J wrapper repository generation script createAllRDF4JRepos.sh, "false', which specifies whether we would like to overwrite the repository in case it already exists. We also pipe the standard output and standard error descriptors to a log file.

tioannid@NUC8i7BEH:/data/GeoRDFBench/scripts$ cd ../RDF4JSUT/scripts/CreateRepos/
tioannid@NUC8i7BEH:/data/GeoRDFBench/RDF4JSUT/scripts/CreateRepos$ ./createAllRDF4JRepos.sh false 2>&1 | tee -a logCreateRepo_Scal10K_RDF4J.log
All of the following environment variables (Geographica/scripts/prepareRunEnvironment.sh) are defined: 
	{DatasetBaseDir, RDF4JRepoBaseDir, JVM_Xmx, EnableLuceneSail, CompletionReportDaemonIP, CompletionReportDaemonPort, ScalabilityGenScriptName, ScalabilityGzipRefDSName}
Running script with syntax: 
	./createAllRDF4JRepos.sh false /data/Geographica2_Datasets /data/RDF4J_3.7.7_Repos/server -Xmx24g false 192.168.1.44 3333 /data/Geographica2_Datasets/Scalability/scalabilityDSGen.sh /data/Geographica2_Datasets/Scalability/scalability500MRefDS.nt.gz
Script start time: Δευ 08 Μαΐ 2023 08:48:06 μμ EEST
Checking/Creating scalability 10K dataset ...
	Scalability 10K dataset does not exist
	/data/Geographica2_Datasets/Scalability/10K directory created
	Scalability Generator script successfully extracted scalability 10K dataset to "/data/Geographica2_Datasets/Scalability/10K/scalability_10K.nt"
Generating scalability 10K repository ...
	./createRDF4JRepo.sh /data/RDF4J_3.7.7_Repos/server scalability_10K false "spoc,posc" N-TRIPLES /data/Geographica2_Datasets/Scalability/10K -Xmx24g false "http://www.opengis.net/ont/geosparql#asWKT" 192.168.1.44 3333
CREATE_REPO_ARGS = createman "/data/RDF4J_3.7.7_Repos/server" "scalability_10K" "FALSE" "false" "spoc,posc" "http://www.opengis.net/ont/geosparql#asWKT" true
LOAD_REPO_ARGS = dirloadman "/data/RDF4J_3.7.7_Repos/server" "scalability_10K" "N-TRIPLES" "/data/Geographica2_Datasets/Scalability/10K" true
log4j:ERROR Could not connect to remote log4j server at [localhost]. We will try again later.
0    [main] INFO  RDF4JSystem  - Creating NativeStore base sail with spoc,posc indexes
257  [main] INFO  RepoUtil  - RDF4J created with manager repo "/data/RDF4J_3.7.7_Repos/server/repositories/scalability_10K" in 99 msecs
log4j:ERROR Could not connect to remote log4j server at [localhost]. We will try again later.
1    [main] INFO  RDF4JSystem  - Loading file scalability_10K.nt ...
709  [main] INFO  RDF4JSystem  - Finished loading file scalability_10K.nt in 708 msecs
710  [main] INFO  RepoUtil  - RDF4J loaded with manager all files from "/data/Geographica2_Datasets/Scalability/10K" to repo "/data/RDF4J_3.7.7_Repos/server/repositories/scalability_10K" in 709 msecs
RDF4J repository "/data/RDF4J_3.7.7_Repos/server/repositories/scalability_10K" has size: 4MB
Script end time: Δευ 08 Μαΐ 2023 08:48:08 μμ EEST

The RDF4J wrapper repository generation script, first checks if either all required script parameters have been passed or alternatively if all required environment variables have been set by the preparation script and informs the user about the actual script call that will be executed. This script call can, later on and if a user fills confortable with it, be used to run the RDF4J wrapper repository generation script with the desired ad-hoc arguments without having to first source the preparation script. Afterwards, for any dataset, the script checks if the dataset files are present, but specifically for Scalability datasets, the wrapper repository generation script will also create the dataset by extracting it from the reference dataset. In the listing above, we can see that since the /data/Geographica2_Datasets/Scalability/10K/scalability_10K.nt did not exist, the scalability generator script has been conveniently called and the required dataset file extracted. After that the main repository creation process begins, while informing the user about the actual arguments passed on to the RDF4J core repository generation script createRDF4JRepo.sh, which creates the repository and loads the data to it. Finally the repository location, name, size and loading times are reported.

Verify that the Scalabilty-10K dataset has been extracted
tioannid@NUC8i7BEH:/data/GeoRDFBench/RDF4JSUT/scripts/CreateRepos$ ls -lsa /data/Geographica2_Datasets/Scalability/10K
total 3720
   4 drwxrwxr-x 2 tioannid tioannid    4096 Μαΐ   8 20:48 .
   4 drwxrwxr-x 3 tioannid tioannid    4096 Μαΐ   8 20:48 ..
3712 -rw-rw-r-- 1 tioannid tioannid 3798364 Μαΐ   8 20:48 scalability_10K.nt
Verify that the Scalability-10K RDF4J repository has been created
tioannid@NUC8i7BEH:/data/GeoRDFBench/RDF4JSUT/scripts/CreateRepos$ ls -lsa $RDF4JRepoBaseDir/repositories/
total 12
4 drwxrwxr-x 3 tioannid tioannid 4096 Μαΐ   8 20:48 .
4 drwxrwxr-x 3 tioannid tioannid 4096 Μαΐ   6 21:24 ..
4 drwxrwxr-x 2 tioannid tioannid 4096 Μαΐ   8 20:48 scalability_10K
tioannid@NUC8i7BEH:/data/GeoRDFBench/RDF4JSUT/scripts/CreateRepos$ ls $RDF4JRepoBaseDir/repositories/scalability_10K/
config.ttl  contexts.dat  namespaces.dat  nativerdf.ver  triples-posc.alloc  triples-posc.dat  triples.prop  triples-spoc.alloc  triples-spoc.dat  txn-status  
values.dat  values.hash  values.id
Verify that the repository creation log has been generated
tioannid@NUC8i7BEH:/data/GeoRDFBench/RDF4JSUT/scripts/CreateRepos$ ls -lsa *.log
4 -rw-rw-r-- 1 tioannid tioannid 2384 Μαΐ   8 20:48 logCreateRepo_Scal10K_RDF4J.log
Run the Scalability-10K Workload Experiment with RDF4J

The RDF4J workload experiment execution script runWLTestsForRDF4JSUT.sh is completely independent on the environment prepared earlier for the repository creation.

tioannid@NUC8i7BEH:/data/GeoRDFBench/RDF4JSUT/scripts/CreateRepos$ cd ../RunTests3/
tioannid@NUC8i7BEH:/data/GeoRDFBench/RDF4JSUT/scripts/RunTests3$ ./runWLTestsForRDF4JSUT.sh \
    -Xmx24g \
    -rbd RDF4J_3.7.7_Repos/server \
    -expdesc 2023-08-05_RDF4JSUT_RunWL_Scal10K \
    -wl /data/GeoRDFBench/json_defs/workloads/scalabilityFunc10K_WLoriginal.json \
    -h /data/GeoRDFBench/json_defs/hosts/nuc8i7behHOSToriginal.json \
    -rs /data/GeoRDFBench/json_defs/reportspecs/simplereportspec_original.json \
    -rpsr /data/GeoRDFBench/json_defs/reportsources/nuc8i7behHOSToriginal.json 2>&1 | tee -a RunWLRDF4JExp_Scal10K.log
...(long listing, link below provides the RunWLRDF4JExp_Scal10K.log)...
Verify that the Scalability-10K Workload Experiment log has been generated

Experiment run logs may be quite long, therefore the user can click the link below to view the details of the queryset execution.

tioannid@NUC8i7BEH:/data/GeoRDFBench/RDF4JSUT/scripts/RunTests3$ ls -lsa *.log
72 -rw-rw-r-- 1 tioannid tioannid 69292 Μαΐ   8 21:00 RunWLRDF4JExp_Scal10K.log
Verify that the Scalability-10K Workload Experiment Results have been generated in the Default Location

Experiment results are stored by default in the file system.

tioannid@NUC8i7BEH:/data/GeoRDFBench/RDF4JSUT/scripts/RunTests3$ ls -lsa /data/Results_Store/RDF4JSUT/
total 48
4 drwxrwxr-x 12 tioannid tioannid 4096 Μαΐ   8 21:00 .
4 drwxrwxr-x  6 tioannid tioannid 4096 Μαρ  19 22:39 ..
4 drwxrwxr-x  3 tioannid tioannid 4096 Μαΐ   8 21:00 2023-08-05_RDF4JSUT_RunWL_Scal10K
tioannid@NUC8i7BEH:/data/GeoRDFBench/RDF4JSUT/scripts/RunTests3$ tree  /data/Results_Store/RDF4JSUT/2023-08-05_RDF4JSUT_RunWL_Scal10K/
/data/Results_Store/RDF4JSUT/2023-08-05_RDF4JSUT_RunWL_Scal10K/
└── Scalability
    └── 10K
        └── RDF4JSUT-ExperimentWorkload
            ├── 00-SC1_Geometries_Intersects_GivenPolygon-cold
            ├── 00-SC1_Geometries_Intersects_GivenPolygon-cold-long
            ├── 00-SC1_Geometries_Intersects_GivenPolygon-warm
            ├── 00-SC1_Geometries_Intersects_GivenPolygon-warm-long
            ├── 01-SC2_Intensive_Geometries_Intersect_Geometries-cold
            ├── 01-SC2_Intensive_Geometries_Intersect_Geometries-cold-long
            ├── 01-SC2_Intensive_Geometries_Intersect_Geometries-warm
            ├── 01-SC2_Intensive_Geometries_Intersect_Geometries-warm-long
            ├── 02-SC3_Relaxed_Geometries_Intersect_Geometries-cold
            ├── 02-SC3_Relaxed_Geometries_Intersect_Geometries-cold-long
            ├── 02-SC3_Relaxed_Geometries_Intersect_Geometries-warm
            └── 02-SC3_Relaxed_Geometries_Intersect_Geometries-warm-long

3 directories, 12 files

For each query and execution type (warm, cold) there are two files, a short and a long version. The long version has 4 columns (noOfResults, evaluationTime, scanTime, totalTime) and one row for each execution iteration performed. The short version has 2 columns (noOfResults, totalTime) and only one row which represents the average or median totalTime of the execution iterations from the long version. All times are in nano seconds.

Verify that the Scalability-10K Workload Experiment Results have been generated in the PostgreSQL database

Experiment results are also stored in a custom location, a PostgreSQL database. The details of this report sink can be seen above in the link of section Run the Scalability-10K Workload Experiment with RDF4J. Each experiment details are recorded, with a unique ID, in a row of the EXPERIMENTS table. Each query execution iteration details are in a similar manner recorded in the QUERYEXECUTIONS table. A set of views can provide aggregation for the totalTime and calculation of the Average and Median value of totalTime for each query and execution type (warm, cold). All times are in milli seconds.

(exports from PgAdmin v4)

=== EXPERIMENT ENTRY ========
"id"	"instime"	"exectime"	"description"	"host"	"os"	"sut"	"queryset"	"dataset"	"executionspec"	"reportspec"	"type"
51	"2023-05-08 20:58:17.894+03"	"2023-05-08 20:58:17.883+03"	"2023-08-05_RDF4JSUT_RunWL_Scal10K"	"SimpleHost{ NUC8i7BEH, 192.168.1.44, 32GB, GenericLinuxOS{ Ubuntu-jammy } }"	"GenericLinuxOS{ Ubuntu-jammy }"	"RDF4JSUT"	"scalabilityFunc"	"scalability_10K"	"SimpleES{ COLD=3, WARM=3, action=RUN, maxduration=604800 secs, repmaxduration=86400 secs, func=QUERY_MEDIAN }"	"SimpleReportSpec"	"ScalabilityFunc"

=== AGGREGATE DATA  ===========
"experiment_id"	"query_no"	"cache_type"	"no_iterations"	"mean"	"median"
51	0	"COLD"	3	0.304	0.184
51	0	"WARM"	3	0.116	0.104
51	1	"COLD"	3	0.135	0.115
51	1	"WARM"	3	0.085	0.084
51	2	"COLD"	3	0.108	0.100
51	2	"WARM"	3	0.082	0.082

In the figures below we can see the actual snapshots from the PgAdmin v4 interface.

Experiment Entry
Experiment Aggregate Data

1. T. Ioannidis, G. Garbis, K. Kyzirakos, K. Bereta, M. Koubarakis. Evaluating Geospatial RDF stores Using the Benchmark Geographica 2. (2019) [pdf]