Jena relational database interface - notes on porting

Dave Reynolds, 5/12/01

The jena/rdb module provides an implementation of the jena model interface which stores the RDF statement information in a relational database. The implementation can support a variety of database table layouts and can customize the SQL code to cope with the vagaries of different database implementations.

These notes provide some hints on how the driver/layout combinations are supported and how to build new ones.

Database layouts

One of the aims of the RDB package was to support experimentation with different database layouts. Some of this experimentation was done during the package development (see performance notes) but the main supported layouts included in this release are small variants on the standard triple table schemas. Viz:

RDF_STATEMENTS
Column name
Type
Comments
subject id-ref  
predicate id-ref  
object id-ref  
object_isliteral smallint flags whether "object" is in literal or resource table
model id-ref only used in multiple-model variants
isreified smallint not used at present

RDF_LITERALS
Column name
Type
Comments
id id-ref  
language varchar xml:lang value if available
literal_idx varchar the literal itself or the largest subset of that which is indexable by the database
literal blob the full literal value if the literal won't fit in literal_idx
int_ok smallint flag to indicate that an parse of the literal into an integer is available
int_literal int the integer value of the literal, only valid if int_OK=1
well_formed smallint preserve jena flag that the literal is well-formed xml

RDF_RESOURCES
Column name
Type
Comments
id id-ref  
namespace id-ref pointer to namespace table
localname varchar  

RDF_NAMESPACES
Column name
Type
Comments
id id-ref  
uri varchar  

RDF_MODELS
Column name
Type
Comments
id id-ref  
name varchar Used when reopening a persistent model in a database that supports more than one model.

RDF_LAYOUT_INFO - name/value pairs which define the layout properties
Column name
Type
Comments
name varchar  
val varchar  

The id-ref type used above is typically either mapped to an int or a char string. For some schemes we allocated integer id's for the statements, resources etc by using database sequence generators or using auto-increment columns in which case all id-refs are ints. An alternative approach is to use a unique content hash, such as MD-5 or SHA-1, to generate a globally unique ID which can be used across databases. Depending on the database jdbc driver we can either store this hash-id in a CHAR(16) or we base64-encode it as a string into a CHAR(24) value.

Porting to alternative databases or formats

One design goal for the jena rdb package was to support multiple databases. Out of the box it includes some support for Mysql, Postgresql and Interbase but other databases should be easy or at least possible to support. We achieve this by several tricks:

Porting issues

Most of the porting issues come down to tweaking the SQL text files. Here are some things to watch out for. In later sections we'll go more into the mechanics.

ID type
As noted in the layouts section we need to support different implementations of the cross-table ids - generated integers, raw hash values or base64 encoded hash values. To cater for this the SQL files can use a macro ${id} which the driver will expand to an appropriate SQL type as specified in a configuration file. At the Java end we provide a generic interface class to wrap up the Java objects corresponding to the id type (see IDBID) and provide instances of that for Integers and hash Strings.
Alias syntax
When specifying self-joins you need to supply an alias to the name the different logical table instances. Databases differ on whether the "AS" keyword is mandatory, allowed or forbidden in specifying aliases.
Column sizes
We need to build several indexes and joint indexes on columns. For varchar columns databases vary wildly on how big a column can be indexed. This may lead the limitations on the maximum size of some elements (e.g. uri's and qnames) that can be stored in a given database. For the particular case of literals this limitation was not acceptable so we provide the machinery to overflow big literals into a blob and use a truncated literal (with an appended CRC hash) for the index-able field.
Stored procedures
Adding an element (namespace, resource, statement) to the database requires three operations - checking if it is already there, if not then allocating an id (which may be a database operation using generators, an implicit operation using auto-increment columns or a Java-side operation using content-hashing) then finally adding the element itself. Doing all of these at once in a single stored procedure can save around 20% in load times. Not huge but possibly worth it. We provide an example of this approach for Interbase/Generic. To provide this support for a new database type you need to provide the SQL to define, use and delete the procedures and to set the insertByProcedure flag in the driver file or, better, its config file.
Generators
Databases differ in whether and how they support sequence generators. If your database does have some form of generator then simply add the appropriate SQL code for the initDBgenerators, cleanDBgenerators and the various alloc* operations. If your database uses auto-increment columns to achieve this effect then just label the id columns in the table definitions accordingly. If neither is true then you can only support the Hash layout which allows the Java side to reliably allocate unique identifiers.
Select complexity
Some of the retrieval operations require up to three tables to be joined together (typically inner and self joins). In some cases this can most easily be expressed using nested selects but not all databases (e.g. Mysql) support that. The existing select code avoids nesting for these reasons. Then some databases seem to generate more efficient plans if the join type is made explicit rather being left implicit. Finally there are pathological cases where Postgresql in particular generates disastrous query plans. The problem cases are the list* operations which by default both search the statement table for qualifying statements and lookup the subject and predicate uri's (corresponding to the selected statements) from the resources table to reduce the number of calls. Most databases seem to be fine with this but Postgresql can generate horrible query plans in that case. To handle this the Java code can dynamically cope with the list* operations returning just the core statement table values or the additional dereferenced rdf_resources information, so driver file can chose the simpler route if appropriate. See the postgresql sample SQL code in the release.
Prepared statements
The supplied drivers run the SQL operations by creating jdbc prepared statements. In some cases (e.g. Interbase) reusing those prepared statements can yield a 2x or even 3x performance improvement in other cases (e.g. Postgresql) there is no measurable benefit. It all depends on the jdbc driver. There is some risk with carrying prepared statements across transactions with some jdbc drivers. We are currently conservative and flush the statement cache at transaction boundaries. Thus a typical default is to go ahead and cache prepared statements, then if there is any problem with transaction boundaries or any other jdbc driver issue with this then just switch it off - use the cachePreparedStatements property in the config file.

With these porting issues in mind we'll now look more closely at how the RDB actually finds the appropriate driver files and the file formats involved.

Drivers and driver location

When a ModelRDB is created the RDB code locates the appropriate Java driver file for the given database/layout combination and then uses that to implement all store access operations. See IRDBDriver for the full interface required for this driver file.

The connection object DBConnection has the responsibility for finding an appropriate driver class, see the getDriver/2 method. It goes through the following steps:

  1. Attempt to load a class called com.hp.hpl.mesa.rdf.jena.rdb.Driver<Database><Layout>. Thus if an appropriately named class is available on the classpath then this will be used and all the mechanisms below get bypassed. A couple of special cases are supported this way in the released code - see rdb.DriverInterbaseHash for example.
  2. If a specific driver class file cannot be found then it attempts to load a layout-specific textual configuration file which will be used to customize one of the built-in generic drivers. These configuration files are found in:
        etc/<Database>_<Layout>.config
    Currently these files are also assumed to be somewhere on the jvm class path (they are loaded using the classloader even though they are resource files). The release jar file contains an initial set of config files in the etc/ directory but this set is also duplicated at the top level of the release code. It might be better to define a JENA_HOME property and go directly to this top level etc/ directory - to be discussed.
    These layout-specific configuration files define which existing Java driver class is to be used as the base class and provide property values to customize the driver behaviour - in particular to specify the location of a file of SQL commands.
    See below for more information on the configuration files.
  3. If both these routes fail then we look for a general purpose configuration file called:
        etc/<Database>.config
    These provide configuration parameters suited to a given database (and thus SQL dialect). There are no-layout specific tweaks. These generic configuration files still need to specify any database-specific SQL command sources but the Java driver class defaults to the rdb.DriverGeneric<Layout>.

The most common situation is that the built-in Java driver classes are sufficient but some tweaking is needed for a given database. In that case we use branch (3) above and just provide a minimal database-wide configuration file. See the Mysql.config and Postgresql.config files in the distribution. This generic config file then points to a file of SQL commands which override any of the default commands which are not suitable for the given database.

Configuration file formats

The config files mentioned above conform to the java.utils.Properties.load definition (comments lines start with #, parameters specified one per line in the form "param = value"). The Template.config file gives a full list of all the properties that the built-in drivers currently use (see listing below). In practice normally only the first few of these need setting.

# The name of the database being configured

database = Foobase





# The name of the jdbc driver class to be used

jdbcDriver = org.foobase.Driver





# The file of SQL definitions to use

# Usually *_all.sql for most layouts and then specific variants for 

# specific layouts.

sqlFile = etc/foobase_all.sql





# Set to "true" if the driver should cache jdbc PreparedStatements for reuse.

# Some drivers (postgresql?) do nothing special with PreparedStatements anyway so

# this has no  great performance benefit. Can potentially cause conflicts

# with transactions for some divers but can also have 2-3x performance benefit.

cachePreparedStatements = true



# The maximum size of string literals that can be indexed.

# Literals longer than this will be stored in a blob with just the head of the 

# literal indexed.

maxLiteral = 32000



###################################################################

# Options below here should only be set for specific layout configs.



# The name of the layout implemented

# layout = all



# The fully qualified name of the Java Driver class to be extended

# baseClass = com.hp.hpl.mesa.rdf.jena.rdb.DriverGenericGeneric



###################################################################

# Options below here should only be set for specific layout configs

# and even then usually the values set by the base Java class is correct and

# should not be changed unless you are sure.



# The SQL type used for identifiers. This is typically "INTEGER" for normal layouts

# and either CHAR(16) or CHAR(24) for hash layouts. Normally the value inherited from

# the generic driver is correct.

# idType = INTEGER



# Set to "true" if the Java driver should allocate IDs explicitly.

# Typically this is only false when insert procedures have been defined.

# Or you are using auto-inc column types.

# allocateID = true



# Set to "true" if the Java driver should explicitly check for duplicates

# before performing inserts. Typically this is only false if a full insert

# procedure has been defined.

# checkDuplicates = true



# Set to "true" if the driver should use insertFooProc instead of insertFoo to

# insert objects of type Foo (Namespace, Resource, Statement or Literal).

# The insertFooProc operations are usually procedure calls that might do the

# duplication check and ID allocation in line. See "allocateID" and "checkDuplicates".

# insertByProcedure = false

SQL file formats

All of the Java driver files in the distribution load in their SQL commands from text files to allow customization. Like the *.config files described above the SQL files are currently expected to be in files of the form:
      etc/*.sql
somewhere on the classpath.

Each file defines the SQL for a set of named "operations" used by the Java driver files.

All of the supplied Java driver files load two SQL files, one default one for the type of layout and one database specific customization of that. Any operation defined in the more specific SQL file simply replaces the one defined in the generic version.

The syntax of the SQL driver files is crude. Comment lines start with #. Each SQL operation is defined by a block of SQL code. The first line of the block gives the operation name, all subsequent lines are the defining SQL code until a blank line which signals end of the block. Thus

# comment at start of line

operationName1

SQL code line 1

...

SQL code last line

 

# comment describing next operation

operationName2

SQL code

...

The SQL code is typically a single SQL statement but some operations, specifically database initialization and cleanup may require a variable number of statements. To cater for this terminate each statement in those groups with the string ";;". Note that a single ";" is not used because these compound statements are often stored procedure definitions which tend to have embedded";" line terminators.

The full list of operations needed by the current drivers is shown below. The example code is the easiest guide to the calling format for these operations.


operations block/single statement meaning

initDBtables
cleanDBtables

block Create/delete the initial SQL tables for this layout
initDBgenerators
cleanDBgenerators
block Create/delete any sequences or generators needed/possible in this database type
initDBprocedures
cleanDBprocedures
block Create/delete any stored procedures needed (normally only applies to Proc layout types)
getNamespace
getResource
getLiteral
single Return a namespace/resource/literal given its database ID
getNamespaceID
getResourceID
getLiteralID
single Return the database ID of a namespace/resource/literal given its string representation.
allocateNamespaceID
allocateResourceID
allocateLiteralID
single Allocate a new ID for storing a namespace/resource/literal. Not relevant if DB uses stored procedures for this or uses auto-inc index columns.
insertNamespace
insertResource
insertLiteral
insertLiteralInt
insertStatement
single Insert a new namespaces/resource/literal/literal-with-int-translation/statement into the database.

insertNamespaceProc
insertResourceProc
insertLiteralProc
insertLiteralIntProc
insertStatementProc

single Variant on the above which are expected to be references to stored procedures which may check for duplicates and do inline allocation of IDs.
listSubjects single list all resources which are subjects of some statement
listNamespaces single list the namespaces of all properties used in some statement
deleteStatement single delete a statement
listModels
getModelID
allocateModelID
insertModel
single Support for more than one model in a dB
List all model names, lookup a model ID from its name, allocate a new model ID, record a new model name/ID pair.
list listS listP listO listSP listPO listSO listSPO single List all statements with the given combinations of subject, predicate and object values.

Suggested porting steps

  1. Try Generic/Generic it might just work!
  2. If that fails then try to get just the Generic layout working for your database. The advantages of the other layouts is not that great (see performance notes).
    To do this create a "mydatabase.config" file from the Template.config and replace the database name and the name of the SQL file appropriately.
    To generate the SQL file work through the generic_generic.sql using your databases interactive tools, and the prompt list of portability issues above, and figure out what changes are needed.
    Copy just those operations that need changing from generic_generic.sql to your mydatabase_all.sql file and edit those to be conformant.
    Set the other config flags to mirror these changes (specifically, the maximum length of an indexable literal, whether to use stored procedures for inserts and whether the driver needs to explicitly call the allocateId operations).
  3. Run the regression tests (see below).

Regression testing

The jena RDB package does provide support for regression testing. This isn't integrated into the mega jena regression test since you need to have appropriate databases up and running to run the tests.

The primary interface to the RDF regression tests is TestRDB which both exports a junit suite that you can use in larger tests and provides a main entry point so you can interactively (or via scripts) run tests on specific database/layout combinations. In order to use the tests you need to specify appropriate database information using the following Java properties:

jena.rdb.test.dbbbase The base jdbc uri of the database
jena.rdb.test.user The user name for logging in to the database.
jena.rdb.test.password The password for that user.
jena.rdb.test.dbfull The full jdbc uri of the database to use, this overrides the use of testrdf/jr* described below and is needed for databases like Oracle whose jdbc uri cannot be used as a base for a set of related database names.

The RDB specific regression tests expect a database called "testrdf" to exist (this string is appended to the jena.rdb.test.dbbase value). This database will be reused for all the internal tests and should be left clean.

TestRDB will then run the standard jena regression tests on models built using the database. If the layout mode is supposed to support more that one model in a database it only needs a single database called "jr" (jena regression) and the four jena regression models will all be stored in that. However, if the layout only supports a single model per database then four databases are needed (jr1 through jr4).

If the jena.rdb.test.dbfull parameter has been set then given database name is used for both the RDB core tests and the full jena regression test (multi-model layout). This workaround enables us to test Oracle configurations where the database uri does not end with the local database name but does not support regression testing of single model layouts.