Jena relational database interface - performance notes

Dave Reynolds, 5/12/01

The jena/rdb module provides an implementation of the jena model interface which stores the RDF statement information in a relational database. The implementation can support a variety of database table layouts and can customize the SQL code to cope with the vagaries of different database implementations.

As part of the development we checked many different design decisions on modest scale test data to understand the tradeoffs involved. We've recorded the results of the tests here. A few of these tests will be worth rerunning on more serious data sizes to check scaling effects.

Testing regime

The tests described below all used a small fragment of the open directory project data (dmoz.org). This required a little rewriting to make it legal RDF (indeed legal XML) - wrapping literals as CDATA sections, expanding reserved characters like '&' etc. The timed tests were:

These are not exhaustive. There are other query patterns such as those involving unknown predicates that we could also check. However, they are representative of common operations and give a guide to performance tradeoffs.

All tests where carried out on workstation class machines not servers. These were all PCs in the 700Mhz - 900MHz range, running Win2K, Win98SE or Linux (Red Hat 6.2). All cross comparisons with the tables below are on identical machines but the same set up could not be used for all tests (e.g. we could not get Interbase working reliably on Linux so all Interbase tests were on Windows whereas the other databases were Linux only). Thus inter-table comparison is not valid. We stuck to running tests on the same machine as the database to avoid having to correct for network effects.

Since we were running scores of tests on workstation class machines large scale tests were infeasible. Instead we used tiny fragments of the dmoz data ranging from 20k to 200k statements (around 50-80Mb databases). Thus the figures obtained are representative of small scale applications where persistence rather then high volumes are critical - more serious scaling tests would need at least an order of magnitude more data. However, where comparisons between the 20k/200k datapoints were made no major differential scaling effects came up so the even these small tests have some validity.

Summary of observations

The figures

All figures in the tables below are in ms average over enough runs to make the timing accurate and are all "warm figures". That is the queries were run several times before measurement started so that the queries could then be measured in a loop.

Attribute tables

An alternative layout type to the generic statement table is the attribute table approach in which a separate table is generated for each predicate which simply maps subject to object values. This was suggested for XML storage in Florescu & Kossmann A performance evaluation of Alternative mapping schemes for store XML data in a relational Database, INRIA, 1999. An extension of this used in the ICS-FORTH RDFSuite is to use subtabling to directly represent an RDFS property hierarchy within the database.

We implemented a simplified driver for this layout which allows the user to specify a set of predicates which should be cached in this way and uses the generic statement table for all other predicates. Comparing this to the generic layout on an Interbase database gave the results on small test data (20ks) with all relevant attributes set to be cached as separate tables:

 Layout 
 load  
 queryTree 
 queryRev 
 queryInt 
 queryEq 
 querySW 
Generic/Interbase
11.1
1.7
5.4
22.0
5.7
63.3
Attribute/Interbase
15.0
1.8
6.4
29.3
5.7
63.0

This suggests no benefit to this type of partitioning given the indexing capabilities of Interbase. There are several caveats here. First, we should rerun this test on much larger databases to draw any general conclusions from it. Second, our tests are simple triple level queries. In the ICS-FORTH work they show how the table inheritance properties of postgresql can be exploited to support rdf schema queries involving sub-property hierarchies - something tht could not easily be done using a universal statement table approach.

Content hashes

One of the major portability problems with databases is the wide variation in methods for generating unique row identifiers, something we need to do rather a lot. One obvious alternative to using database-generated numbers for entries is to use a probabilistically-unique content hash algorithm (such as MD5) to generate the IDs. These have the disadvantage that such hashes are larger (which can lead to less efficient indexing by the database) and take more cycles to compute than a simple transactional counter. On the other had they are completely database independent and globally stable - this means that we could take data from two different databases and merge them by outer joins on each table. What are the performance tradeoffs like? Testing on Interbase we found:

 Layout 
 load  
 queryTree 
 queryRev 
 queryInt 
 queryEq 
 querySW 
GenericProc/Interbase
13.5
1.2
2.8
31.1
3.0
243.2
Hash/Interbase
31.8
1.3
2.9
35.5
3.1
261.8

The only significant different here is in load times - this is partly because the Generic form can use stored procedures, partly because IDs form a large part of the data transfer with hash ids 4x bigger than ints and partly due to the computational cost of the hash algorithms on a mid-range PC.

Storage manager comparison

Earlier studies such as "The design and performance evaluation of alternative XML storage strategies", Feng Tian et al, U. Wisconsin-Madison paper#151 suggest that raw storage manager systems will outperform full database systems for semi-structured storage such as XML (and by implication RDF). Jena already supports the Sleepycat Berkeley dB storage manager so we compared it on a Win2k plaform against Interbase and against Mysql and Postgresql on Linux (Red Hat 6.2). The figures were:

 Layout 
 load  
 queryTree 
 queryRev 
 queryInt 
 queryEq 
 querySW 
Generic/Interbase
16.0
1.2
3.7
25.5
3.1
79.8
Generic/Postgresql
10.2
1.4
7.6
3.8
7.3
14.0
Generic/Mysql
8.3
0.97
4.6
0.96
4.2
1.1
Berkeley dB
3.5
0.13
0.64
3.1
0.37
12.1

Thus we see a factor of between 5 and 10 difference in favor of BDB with the exception of the range queries (int range and string pattern match) where some databases do outperform BDB, at least using jena's default choice of index schemes.

This comparison is grossly unfair because, even though we turned logging on in BDB, the jena BDB module does not support full transactions whereas that is a key feature of using the database machinery. Nevertheless we can conclude that if simple persistence, rather than full transaction support, is your goal and if your queries are primarily link following then use the BDB module. If your queries involve partial literal matching then Mysql looks like a good bet.

Indexing options

The primary lookups in our generic layout involve the large rdf_statements table. What indexing schemes are the right ones to use? A common pattern is to use a join index on subject/predicate ids along with an index on objects or another joint index on predicate/object. Why not just throw everything in? We tested a few representative index schemes using Postgresql on 200ks sets with the following results:

Index scheme
 load  
 queryTree 
 queryRev 
 queryInt 
 queryEq 
 querySW 
(SP) O
9.0
1.5
7.6
8.1
6.8
93.9
(SP) O-
10.1
1.7
8.7
17.4
8.3
114.4
S P O-
17.2
1.9
34.9
4.2
152.8
96.7
S P O- (SP) (PO-)
18.1
1.5
9.4
3.0
7.8
88.7

Here O- indicates indexing on the raw "object" field rather than jointly on "object" and "object_isliteral".

These seem to suggest that the joint (SP) index is indeed superior to separate S P indices. Indexing every-which-way has some load cost (2x worse than the default scheme) and relatively little advantage. These choices are, however, quite database dependent. For example, the long queryEq time on the S P O- case is, may bedue to a pathological query plans (see below) rather than an actual effect of indexing.

Multiple models and query planners

Some of the layouts offer the option to tag each statement with a context id (a "model" in jena's terminology, not to be confused with model-based semantics in logician-speak). This allows a single database to support multiple jena models. The cost is clearly that each of the tables is then padded with statements which are irrelevant to a query within a given model and are slowed down proportionately. This cost is very application dependent since it depends on the overlap of resource uri's and predicates between the co-stored models.

However, in the early stages of development we did a quick check to ensure that if you are using a multi-model scheme but are only actually storing one model the overhead should be negligible. On Interbase this was true "out of the box". On postgresql we found major variations in timing. It turns out that the list operations (which select from the rdf_statements table and then join the results against the rdf_resources table to lookup the subject and predicate uri's) are hard for postgresql to plan. In particular, if we include an index on the model_id field then postgresql will try to use it, even though it has no selectivity in the case of a single-model database, and the query reduces to a brute force iteration over the whole database. Removing these joins and doing them manually at the jdbc end overcomes these problems. The figures are instructive. These are figures for postgresql, on Linux, using the MMGeneric layout for just a 20k statement data fragment - hardly challenging:

Scheme
 load  
 queryTree 
 queryRev 
 queryInt 
 queryEq 
 querySW 
full join, model_id indexed
9.9
3.3
963.0
31.6
643.0
123.0
full join, no model_id index
10.1
3.3
56.3
19.8
81.4
110.0
manual join
10.1
3.6
6.1
20.7
6.9
112.0

This particular case has, as indicated, been resolved for the released code. However, when porting rdf stores to other databases similar query plan issues may arise due to the all-in-one-table approach stressing the query planning and some manual tuning may be required.

Stored procedures

Adding an element (namespace, resource, statement) to the database requires three operations - checking if it is already there, if not then allocating an id (which may be a database operation using generators, an implicit operation using auto-increment columns or a Java-side operation using content-hashing) then finally adding the element itself. Doing all of these at once in a single stored procedure could save time during loads.

 Layout 
 load  
 load 
 load 
 load 
GenericProc/Interbase
9.1
11.3
17.3
15.3
Generic/Interbase
6.6
8.6
13.8
12.4

The four "load" figures are averages over loading discrete subsets (5-7k statements) of the dmoz test data. Presumably these early loads will show bigger advantages for stored procedures than when the database is large and the times are dominated by the underlying search and insert operations. Whilst these do some some performance gains (around 25%) the differnce is not large compared to the cost of writing and debugging stored procedures.

jdbc prepared statement caching

The supplied drivers run the SQL operations by creating jdbc prepared statements. If we cache these prepared statements and reuse them the next time the same operation (but with different parameters) is needed there may be some savings - reduces SQL parsing costs at least. There is some risk with carrying prepared statements across transactions with some jdbc drivers. We are currently conservative and flush the statement cache at transaction boundaries to avoid these problems.

This caching is turned on by default. For some jdbc drivers (e.g. postgresql) this has no measurable effect since each statement is treated afresh by the driver. For some (e.g. the Interbase Interclient driver 1.5.1) substantial savings are possible:

 Layout 
 load  
 queryTree 
 queryRev 
 queryInt 
 queryEq 
 querySW 
GenericProc/Interbase/Cache on
11.4
1.7
4.1
22
4.4
62.9
GenericProc/Interbase/Cache off
23.5
4.5
12.4
25.6
12.7
67