Reducing the Network Load of Triple Pattern Fragments by Supporting Bind Joins

Olaf Hartig Carlos Buil-Aranda
Linköping University Universidad Tecnica Federico Santa Maria

This page provides all digital artifacts related to our poster paper in the 15th International Semantic Web Conference (ISWC 2016). All content on this page is licensed under the Creative Commons Attribution-Share Alike 3.0 License.

Table of content:


Abstract

The recently proposed Triple Pattern Fragment (TPF) interface aims at increasing the availability of Web-queryable RDF datasets by trading off an increased client-side query processing effort for a significant reduction of server load. However, an additional aspect of this trade-off is a very high network load. To mitigate this drawback we propose to extend the interface by allowing clients to augment TPF requests with a VALUES clause as introduced in SPARQL 1.1. In an ongoing research project we study the trade-offs of such an extended TPF interface and compare it to the pure TPF interface. With a poster in the conference we aim to present initial results of this research. In particular, we would like to present a series of experiments showing that a distributed, bind-join-based query execution using this extended interface can reduce the network load drastically (in terms of both the number of HTTP requests and data transfer).

Document

PDF file Paper (4 pages)

Additional Results

In addition to conducting the DBpedia/FEASIBLE experiments presented in the extend abstract, we also performed the same type of experiments based on the Waterloo SPARQL Diversity Test Suite (WatDiv). That is, by using the same setup as described in Section 2 of the extended abstract, we ran a sequence of 145 WatDiv queries over the 10M triples WatDiv dataset. Given the measurements obtained by these runs, we generated the same type of charts as we show in the extended abstract for the FEASIBLE-based measurements. The following document provides these charts.

Given these additional charts, we note that the WatDiv-based measurements verify all experimental results that we show in the extended abstract based on DBpedia and the FEASIBLE queries.

Software

For our experiments we used a server component and a client component, each of which contains the functionality to use either the TPF approach or the brTPF approach.

Combined TPF/brTPF Server

As a basis for the server component we used the Java servlet implementation of the TPF interface and extended it to also support the brTPF interface.

To start the server in standalone mode, just edit the config.json file to point to the HDT file of the dataset (there is an example of such a configuration file in the source code package) and execute the following command:

java -server -Xmx4g -jar ldf-server.jar config.json

Combined TPF/brTPF Client

For the client component we used a Node.js-based TPF client and added a brTPF-based query execution algorithm to it.

To run a sequence of SPARQL queries using the TPF-based query execution algorithm, copy the files with these queries (one query per file) into new directory, say mytestqueries, and execute the following command:

./eval.sh ldf-client-eval config.json mytestqueries

Similarly, if you want to use the brTPF-based query execution algorithm, execute the following command (the parameter --maxNumberOfMappings as used in the eval.sh script can be used to set the maxM/R value to be used for the query executions):

./eval.sh brTPF-client-eval config.json mytestqueries

During their execution, these command write to a CSV file called eval_TPF.csv and eval_brTPF.csv, respectively. These file will contain one line per query that was executed. The columns in these CSV files have the following meaning: i) name of the file with the query, ii) number of triple patterns, iii) execution time in milliseconds until the first solution of the query result was available, iv) number of HTTP requests issued until the first solution of the query result was available, v) overall query execution time (in ms), vi) overall number of HTTP requests issued during the query execution, vii) overall number of triples received during the query execution, viii) "TIMEOUT" marker if the query execution timed out, ix) timeout threshold in minutes (if the query execution timed out). The command for the brTPF client writes an additional CSV file, eval2_brTPF.csv, which provides statistics about the number of mappings that were associated with the brTPF requests sent during the query executions. That is, the first column provides the name of the file with the query, the second column indicates the number of requests without a mapping (i.e., these are ordinary TPF requests), the third column indicates the number of requests with exactly one mapping, the fourth column indicates the number of requests with exactly two mappings, etc.

Data and Queries

DBpedia and FEASIBLE Queries

For the experiments reported in the paper we used an RDF-HDT representation of the DBpedia 3.5.1 dataset. That is, we downloaded the dataset and converted it into the following HDT file:

To obtain the queries we used the FEASIBLE tool and generated a set of 1000 BGP queries. We filtered this set by removing all queries whose BGP consisted of a single triple pattern only (such queries would be executed in exactly the same way by both the TPF-based query execution algorithm and the brTPF-based query execution algorithm). From the remaining queries we selected uniformly at random the following set of 100 queries:

WatDiv

For the experiments we used an RDF-HDT representation of the 10M triples WatDiv dataset as provided on the WatDiv project page. That is, we downloaded the original watdiv.10M.tar.bz2 file (56MB) and converted it into the following HDT file:

To obtain WatDiv queries we downloaded the WatDiv stress test query workload (3MB) from the Web page of the WatDiv paper. Given this workload, we selected uniformly at random the following set of 145 queries: