brTPF: Bindings-Restricted Triple Pattern Fragments

Olaf Hartig Carlos Buil-Aranda
Linköping University Universidad Tecnica Federico Santa Maria

This page provides all digital artifacts related to our paper in the 15th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE 2016). All content on this page is licensed under the Creative Commons Attribution-Share Alike 3.0 License.

Table of content:


Abstract

The Triple Pattern Fragment (TPF) interface is a recent proposal for reducing server load in Web-based approaches to execute SPARQL queries over public RDF datasets. The price for less overloaded servers is a higher client-side load as well as a dramatic increase in network load (in terms of both the number of HTTP requests and data transfer). In this paper, we propose a slightly extended interface that allows clients to attach intermediate results to triple pattern requests. The response to such a request is expected to contain triples from the underlying dataset that do not only match the given triple pattern (as in the case of TPF), but that are guaranteed to contribute in a join with the given intermediate result. Our hypothesis is that a distributed query execution using this extended interface can reduce the network load (in comparison to pure TPF-based query execution) without increasing the server load significantly. Our main contribution in this paper is twofold: we empirically verify the hypothesis and provide an extensive experimental comparison of our proposal and TPF.

Document

PDF file Extended Preprint (18 pages; in contrast to the proceedings version of the paper, this extended preprint contains Appendixes A and B which present additional experimental results)

Software

For our experiments we used a server component and a client component, each of which contains the functionality to use either the TPF approach or the brTPF approach.

Combined TPF/brTPF Server

As a basis for the server component we used the Java servlet implementation of the TPF interface and extended it to also support the brTPF interface and, optionally, record caching-related statistics (see the classes CacheStatsTPFServlet and BoundedCacheStatsTPFServlet for the latter).

To start the server in standalone mode, just edit the config.json file to point to the HDT file of the dataset (there is an example of such a configuration file in the source code package) and execute the following command:

java -server -Xmx4g -jar ldf-server.jar config.json

Combined TPF/brTPF Client

For the client component we used a Node.js-based TPF client and added a brTPF-based query execution algorithm to it.

To run a sequence of SPARQL queries using the TPF-based query execution algorithm, copy the files with these queries (one query per file) into new directory, say mytestqueries and execute the following command:

./eval.sh ldf-client-eval config.json mytestqueries

Similarly, if you want to use the brTPF-based query execution algorithm, execute the following command (the parameter --maxNumberOfMappings as used in the eval.sh script can be used to set the maxM/R value to be used for the query executions):

./eval.sh brTPF-client-eval config.json mytestqueries

During their execution, these command write to a CSV file called eval_TPF.csv and eval_brTPF.csv, respectively. These file will contain one line per query that was executed. The columns in these CSV files have the following meaning: i) name of the file with the query, ii) number of triple patterns, iii) execution time in milliseconds until the first solution of the query result was available, iv) number of HTTP requests issued until the first solution of the query result was available, v) overall query execution time (in ms), vi) overall number of HTTP requests issued during the query execution, vii) overall number of triples received during the query execution, viii) "TIMEOUT" marker if the query execution timed out, ix) timeout threshold in minutes (if the query execution timed out). The command for the brTPF client writes an additional CSV file, eval2_brTPF.csv, which provides statistics about the number of mappings that were associated with the brTPF requests sent during the query executions. That is, the first column provides the name of the file with the query, the second column indicates the number of requests without a mapping (i.e., these are ordinary TPF requests), the third column indicates the number of requests with exactly one mapping, the fourth column indicates the number of requests with exactly two mappings, etc.

Data and Queries

For the experiments we used an RDF-HDT representation of the 10M triples WatDiv dataset as provided on the WatDiv project page. That is, we downloaded the original watdiv.10M.tar.bz2 file (56MB) and converted it into the following HDT file:

To obtain WatDiv queries we downloaded the WatDiv stress test query workload (3MB) from the Web page of the WatDiv paper. This workload consists of 12,400 queries. We used these queries as follows:

Experimental Comparison to a Virtuoso-based SPARQL Endpoint

We tried to conduct a comparison of brTPF and a Virtuoso-based SPARQL endpoint in our experimental setup. To this end, we installed the latest version of the Open Source edition of Virtuoso (v.7.2.4) on the machine that we used as the server in our multi-machine experiments (recall from the paper that this machine has an Intel Core i7 processor with 4 cores at 2.6Ghz and 8 GB of main memory), and we loaded the WatDiv dataset (see above) into this Virtuoso endpoint. For the clients we developed a simply Python script that submits the sequence of WatDiv queries of the respective client using the SPARQL endpoint REST API, one query after another (recall from the paper that each of the up to 64 clients uses a distinct set of 193 queries that are always executed as a sequence in the same order). When running the Virtuoso-based experiment with 4 clients (no cache) everything went smooth and Virtuoso achieved a throughput of 217 q/hr. With 8 clients, we saw 33 query executions hitting our 5 mins timeout (over the course of one hour), and the throughput without these timeouts was 1420 q/hr, which is roughly twice the throughput achieved by brTPF and roughly 4x of what the TPF approach achieved in this experiment! However, when we increase the load to 16 concurrent clients, Virtuoso manages to answer 294 queries before it crashes, which happens around 2 minutes after starting this experiment. Before the crash, Virtuoso writes out log messages such as the following: "System is under high load. Adding cluster nodes or using more replicated copies may needed." We provide the Virtuoso configuration file and the Virtuoso log files from this test here:

We emphasize that for this test we increased the Virtuoso result size limit to 1M (see the parameter ResultSetMaxRows in the configuration file) because, for the sake of a fair comparison, we want Virtuoso to return complete query results (a significant number of WatDiv queries has results with more than 100K solutions). However, out of curiosity we repeated the 16-clients run with the ResultSetMaxRows parameter set to 100K. With this restricted configuration, Virtuoso does not have any trouble serving the 16 concurrent clients, but we observe a high number of queries for which Virtuoso returns a result of the exact size of 100K. Hence, these all are incomplete query results! Therefore, from this test we conclude that the reason for why Virtuoso crashed when trying to correctly serve our 16 clients (with complete query results) is due to the difficulty of high-cardinality WatDiv queries. This observation also explains why the BSBM-based evaluation in the original TPF paper did not reveal this issue (BSBM queries have much smaller result sizes as shown in the WatDiv paper).