Jena relational database interface - introduction

Dave Reynolds, 5/12/01

The jena/rdb module provides an implementation of the jena model interface which stores the RDF statement information in a relational database. The implementation can support a variety of database table layouts and can customize the SQL code to cope with the vagaries of different database implementations.

Contents

Getting started - creating and accessing database instances
Multiple models per database
Constraints
Database layouts
Notes

Getting started - creating and accessing database instances

Database-backed RDF models are instances of the class jena.rdb.ModelRDB. As well as implementing the full jena.model.Model interface the static methods on ModelRDB provide means to create, extend and reopen database instances.

First consider the situation where we have an available database but as yet it has no RDF models stored in it and we want to format it for holding RDF statements. In that case we would use:




    DBConnection dbcon = new DBConnection(DATABASE_URI, user, password);



    ModelRDB model = ModelRDB.create(dbcon, LAYOUT_STYLE, DATABASE_TYPE);



The DBConnection class provides different methods for specifying the underlying database. In particular it can be specified, as in the example above, as a jdbc uri (e.g. jdbc:interbase:\\localhost:\databases\test.gdb) along with any required user name and password. Alternatively, the database connection can be opened using the standard jdbc calls and the resulting jdbc Connection object can be wrapped up as a DBConnection for passing on the ModelRDB.create.

The ModelRDB.create call takes two arguments in addition to the database connection itself. Firstly, the LAYOUT_STYLE is a string defining the type of database table structure to be used. Typical values for this include:
Generic General layout, all statements are stored in a single table. Resources and literals are indexed using integer id's generated by database sequence generators.
GenericProc Variant on the generic layout that uses stored procedures for all model updates, this can have a 30-50% performance advantage in some cases.
MMGeneric Similar layout to "Generic" but can support more than one jena model in a single database.
Hash Similar layout to "Generic" but uses MD5 hashes to generate the id's for resources and literals - this avoids relying on the database generators and is more portable for a small performance hit.
MMHash Similar layout to "Hash" but can support more than one jena model in a single database.

The second argument DATABASE_TYPE is a string defining the type of the database. Whilst, jdbc offers good database independence most SQL code remains database-dependent - for example sequence generators, stored procedures and limitations on table indexes all vary across databases. The jena RDB modules cope with this by allowing implementors to customize the SQL code to suit the database server to be used. If using a portable layout such as "Generic" or "Hash" then the DATABASE_TYPE of "Generic" may work otherwise use a specific database name here. The distribution includes configuration files for "Interbase", "Mysql", "Postgresql" and "Oracle". Others can be created. The matrix of currently supported layouts is:

Database Layouts
Postgresql Generic, MMGeneric, Hash, MMHash
Mysql Generic, MMGeneric, Hash, MMHash
Interbase Generic, MMGeneric, Hash, MMHash
[Implementations of GenericProc, MMGenericProc are also provided but not supported and are subject to code-rot.]
Oracle MMGeneric

The call to ModelRDB.create will create the appropriate database tables and record within the database a note of the layout chosen. This means that a previously created database can be reopened using:




    DBConnection dbcon = new DBConnection(DATABASE_URI, user, password);



    ModelRDB model = ModelRDB.open(dbcon);



Note that no layout or database information is needed this time - it is retrieved from the pre-formatted database.

Multiple models per database

Some database formats only support one jena model per database. Other layouts can support multiple models with a single database - these have slightly lower performance but can be more convenient. Thus if dbConnection is a connection to an already formatted database whose layout supports multiple models then the call:




    ModelRDB model = model.createModel(dbConnection, modelName);



will create an additional model within the same database. The modelName can be used to reopen the same model in the future using:



    ModelRDB model = model.open(dbConnection, modelName);



and



    Iterator it = ModelRDB.listModels(dbConnection);



will list the name of all the models stored in the database.

Constraints

The ModelRDB interface supports all the standard jena facilities for navigating the model. This allows us to, for example, find all statements with a given pattern of subject, property and object values. If we wish to perform partial matching on object literal values (e.g. finding all statements whose literal object value starts with "foo" or is an integer in the range [2,8), say) then we have to use the Selector mechanism. Unfortunately in this case all candidate statements with matching subject and property values will be retrieved and then filtered by the supplied Selector.test() code.

The RDB package allows us to use the underlying database implementation by providing an alternative mechanism for listing statements - that of constraints. For example,




    IConstraints constraints = modelrdb.createConstraints();



    constraints.addSubjectConstraint(foo)



               .addPropertyConstraint(prop);



    Iterator statements = modelrdb.listStatements(constraints);



will return an iterator over all statements in the model with subject foo and property prop. More interestingly the code:



    IConstraints constraints = modelrdb.createConstraints();



    constraints.addSubjectConstraint(foo)



               .addPropertyConstraint(prop)



               .addStringConstraint("NOT LIKE", "%bar%");



    Iterator statements = modelrdb.listStatements(constraints);



will list just that subset of the above statements whose object value is a literal string which does not contain the substring "bar". The first argument of the addStringConstraint call can be any standard SQL string match operation. Note that this is a potential source of porting problems across different databases - most databases support "LIKE" but some don't use the ANSI SQL pattern characters (e.g. '%') and some have other operators ("CONTAINS", "STARTSWITH", "REGEXP").

As well as string matching there is some experimental support for integer-valued literals. When and if jena is extended to support true typed literals a fuller match constraint mechanism might be possible. In the meantime, to support the common case of integer literals we note any literal in the database which could be interpreted as an integer. In this way we can support code such as:



    IConstraints constraints = modelrdb.createConstraints();



    constraints.addSubjectConstraint(foo)



               .addIntConstraint("<=", 42)



               .addIntConstraint(">", 4);



    Iterator statements = modelrdb.listStatements(constraints);



Note that in all these cases the constraints object can be reused which may avoid the overhead of generating and parsing the required SQL code (depending on the nature of the jdbc driver in use).

Database layouts

One of the aims of the RDB package was to support experimentation with different database layouts. Some of this experimentation was done during the package development (see performance notes) but the main supported layouts included in this release are small variants on the standard triple table schemas. Viz:

RDF_STATEMENTS
Column name
Type
Comments
subject id-ref  
predicate id-ref  
object id-ref  
object_isliteral smallint flags whether "object" is in literal or resource table
model id-ref only used in multiple-model variants
isreified smallint not used at present

RDF_LITERALS
Column name
Type
Comments
id id-ref  
language varchar xml:lang value if available
literal_idx varchar the literal itself or the largest subset of that which is indexable by the database
literal blob the full literal value if the literal won't fit in literal_idx
int_ok smallit flag to indicate that an parse of the literal into an integer is available
int_literal int the integer value of the literal, only valid if int_OK=1
well_formed smallint preserve jena flag that the literal is well-formed xml

RDF_RESOURCES
Column name
Type
Comments
id id-ref  
namespace id-ref pointer to namespace table
localname varchar  

RDF_NAMESPACES
Column name
Type
Comments
id id-ref  
uri varchar  

RDF_MODELS
Column name
Type
Comments
id id-ref  
name varchar Used when reopening a persistent model in a database that supports more than one model.

RDF_LAYOUT_INFO - name/value pairs which define the layout properties
Column name
Type
Comments
name varchar  
val varchar  

The id-ref type used above is typically either mapped to an int or a char string. For some schemes we allocated integer id's for the statements, resources etc by using database sequence generators or using auto-increment columns in which case all id-refs are ints. An alternative approach is to use a unique content hash, such as MD-5 or SHA-1, to generate a globally unique ID which can be used across databases. Depending on the database jdbc driver we can either store this hash-id in a CHAR(16) or we base64-encode it as a string into a CHAR(24) value. For more details see the porting notes.

The layouts currently defined are:

Layout
Supports multiple-models?
Uses hash ids?
Comments
Generic no no See above for details
MMGeneric yes no  
GenericProc no no Variant on generic that uses stored procedures for updates
MMGenericProc yes no Variant on generic that uses stored procedures for updates
Hash no yes  
MMHash yes yes  

The supplied configuration files for Interbase support all 6 variants, those for Mysql and Postgresql do not support the two "proc" variants but support all the others. In this case "support" means passes all jena regression tests.

Notes