Dave Reynolds, 5/12/01
The jena/rdb module provides an implementation of the jena model
interface which stores the RDF statement information in a relational database.
The implementation can support a variety of database table layouts and can customize
the SQL code to cope with the vagaries of different database implementations.
These notes provide some hints on how the driver/layout combinations are supported and how to build new ones.
One of the aims of the RDB package was to support experimentation with different database layouts. Some of this experimentation was done during the package development (see performance notes) but the main supported layouts included in this release are small variants on the standard triple table schemas. Viz:
RDF_STATEMENTS | ||
Column name
|
Type
|
Comments
|
subject | id-ref | |
predicate | id-ref | |
object | id-ref | |
object_isliteral | smallint | flags whether "object" is in literal or resource table |
model | id-ref | only used in multiple-model variants |
isreified | smallint | not used at present |
RDF_LITERALS | ||
Column name
|
Type
|
Comments
|
id | id-ref | |
language | varchar | xml:lang value if available |
literal_idx | varchar | the literal itself or the largest subset of that which is indexable by the database |
literal | blob | the full literal value if the literal won't fit in literal_idx |
int_ok | smallint | flag to indicate that an parse of the literal into an integer is available |
int_literal | int | the integer value of the literal, only valid if int_OK=1 |
well_formed | smallint | preserve jena flag that the literal is well-formed xml |
RDF_RESOURCES | ||
Column name
|
Type
|
Comments
|
id | id-ref | |
namespace | id-ref | pointer to namespace table |
localname | varchar |
RDF_NAMESPACES | ||
Column name
|
Type
|
Comments
|
id | id-ref | |
uri | varchar |
RDF_MODELS | ||
Column name
|
Type
|
Comments
|
id | id-ref | |
name | varchar | Used when reopening a persistent model in a database that supports more than one model. |
RDF_LAYOUT_INFO - name/value pairs which define the layout properties | ||
Column name
|
Type
|
Comments
|
name | varchar | |
val | varchar |
The id-ref
type used above is typically either mapped to an int
or a char
string. For some schemes we allocated integer id's for
the statements, resources etc by using database sequence generators or using
auto-increment columns in which case all id-refs are int
s. An alternative
approach is to use a unique content hash, such as MD-5 or SHA-1, to generate
a globally unique ID which can be used across databases. Depending on the database
jdbc driver we can either store this hash-id in a CHAR(16)
or we
base64-encode it as a string into a CHAR(24)
value.
One design goal for the jena rdb package was to support multiple databases. Out of the box it includes some support for Mysql, Postgresql and Interbase but other databases should be easy or at least possible to support. We achieve this by several tricks:
ModelRDB
and
StoreRDB
remain constant but the when the model is opened a suitable
implementation of the interface IRDBDriver
is found and used.
Arbitrary adaptations can be made at this level but typically we use inheritance
to reuse the code in DriverGenericGeneric
or one of its existing
specializations.Most of the porting issues come down to tweaking the SQL text files. Here are some things to watch out for. In later sections we'll go more into the mechanics.
IDBID
) and provide
instances of that for Integers and hash Strings.insertByProcedure
flag in the driver file or, better, its config file.initDBgenerators
, cleanDBgenerators
and the various alloc*
operations. If your database uses auto-increment
columns to achieve this effect then just label the id columns in the table
definitions accordingly. If neither is true then you can only support the
Hash layout which allows the Java side to reliably allocate unique identifiers.list*
operations which by default both search the statement table
for qualifying statements and lookup the subject and predicate uri's (corresponding
to the selected statements) from the resources table to reduce the number
of calls. Most databases seem to be fine with this but Postgresql can generate
horrible query plans in that case. To handle this the Java code can dynamically
cope with the list*
operations returning just the core statement
table values or the additional dereferenced rdf_resources information, so
driver file can chose the simpler route if appropriate. See the postgresql
sample SQL code in the release.cachePreparedStatements
property in the config file.With these porting issues in mind we'll now look more closely at how the RDB actually finds the appropriate driver files and the file formats involved.
When a ModelRDB is created the RDB code locates the appropriate Java driver
file for the given database/layout combination and then uses that to implement
all store access operations. See IRDBDriver
for the full interface
required for this driver file.
The connection object DBConnection
has the responsibility for
finding an appropriate driver class, see the getDriver
/2 method.
It goes through the following steps:
com.hp.hpl.mesa.rdf.jena.rdb.Driver<Database><Layout>
.
Thus if an appropriately named class is available on the classpath then this
will be used and all the mechanisms below get bypassed. A couple of special
cases are supported this way in the released code - see rdb.DriverInterbaseHash
for example.etc/<Database>_<Layout>.config
Currently these files are also assumed to be somewhere on the jvm class
path (they are loaded using the classloader even though they are resource
files). The release jar file contains an initial set of config files in the
etc/ directory but this set is also duplicated at the top level of the release
code. It might be better to define a JENA_HOME property and go directly to
this top level etc/ directory - to be discussed.etc/<Database>.config
These provide configuration parameters suited to a given database (and
thus SQL dialect). There are no-layout specific tweaks. These generic configuration
files still need to specify any database-specific SQL command sources but
the Java driver class defaults to the rdb.DriverGeneric<Layout>
.The most common situation is that the built-in Java driver classes are sufficient
but some tweaking is needed for a given database. In that case we use branch
(3) above and just provide a minimal database-wide configuration file. See the
Mysql.config
and Postgresql.config
files in the distribution.
This generic config file then points to a file of SQL commands which override
any of the default commands which are not suitable for the given database.
The config files mentioned above conform to the java.utils.Properties.load
definition (comments lines start with #, parameters specified one per line in
the form "param = value"). The Template.config file gives a full list
of all the properties that the built-in drivers currently use (see listing below).
In practice normally only the first few of these need setting.
# The name of the database being configured database = Foobase # The name of the jdbc driver class to be used jdbcDriver = org.foobase.Driver # The file of SQL definitions to use # Usually *_all.sql for most layouts and then specific variants for # specific layouts. sqlFile = etc/foobase_all.sql # Set to "true" if the driver should cache jdbc PreparedStatements for reuse. # Some drivers (postgresql?) do nothing special with PreparedStatements anyway so # this has no great performance benefit. Can potentially cause conflicts # with transactions for some divers but can also have 2-3x performance benefit. cachePreparedStatements = true # The maximum size of string literals that can be indexed. # Literals longer than this will be stored in a blob with just the head of the # literal indexed. maxLiteral = 32000 ################################################################### # Options below here should only be set for specific layout configs. # The name of the layout implemented # layout = all # The fully qualified name of the Java Driver class to be extended # baseClass = com.hp.hpl.mesa.rdf.jena.rdb.DriverGenericGeneric ################################################################### # Options below here should only be set for specific layout configs # and even then usually the values set by the base Java class is correct and # should not be changed unless you are sure. # The SQL type used for identifiers. This is typically "INTEGER" for normal layouts # and either CHAR(16) or CHAR(24) for hash layouts. Normally the value inherited from # the generic driver is correct. # idType = INTEGER # Set to "true" if the Java driver should allocate IDs explicitly. # Typically this is only false when insert procedures have been defined. # Or you are using auto-inc column types. # allocateID = true # Set to "true" if the Java driver should explicitly check for duplicates # before performing inserts. Typically this is only false if a full insert # procedure has been defined. # checkDuplicates = true # Set to "true" if the driver should use insertFooProc instead of insertFoo to # insert objects of type Foo (Namespace, Resource, Statement or Literal). # The insertFooProc operations are usually procedure calls that might do the # duplication check and ID allocation in line. See "allocateID" and "checkDuplicates". # insertByProcedure = false
All of the Java driver files in the distribution load in their SQL commands
from text files to allow customization. Like the *.config files described above
the SQL files are currently expected to be in files of the form:
etc/*.sql
somewhere on the classpath.
Each file defines the SQL for a set of named "operations" used by the Java driver files.
All of the supplied Java driver files load two SQL files, one default one for the type of layout and one database specific customization of that. Any operation defined in the more specific SQL file simply replaces the one defined in the generic version.
The syntax of the SQL driver files is crude. Comment lines start with #. Each SQL operation is defined by a block of SQL code. The first line of the block gives the operation name, all subsequent lines are the defining SQL code until a blank line which signals end of the block. Thus
# comment at start of line operationName1 SQL code line 1 ... SQL code last line # comment describing next operation operationName2 SQL code ...
The SQL code is typically a single SQL statement but some operations, specifically database initialization and cleanup may require a variable number of statements. To cater for this terminate each statement in those groups with the string ";;". Note that a single ";" is not used because these compound statements are often stored procedure definitions which tend to have embedded";" line terminators.
The full list of operations needed by the current drivers is shown below. The example code is the easiest guide to the calling format for these operations.
operations | block/single statement | meaning |
|
block | Create/delete the initial SQL tables for this layout |
initDBgenerators |
block | Create/delete any sequences or generators needed/possible in this database type |
initDBprocedures |
block | Create/delete any stored procedures needed (normally only applies to Proc layout types) |
getNamespace |
single | Return a namespace/resource/literal given its database ID |
getNamespaceID |
single | Return the database ID of a namespace/resource/literal given its string representation. |
allocateNamespaceID |
single | Allocate a new ID for storing a namespace/resource/literal. Not relevant if DB uses stored procedures for this or uses auto-inc index columns. |
insertNamespace |
single | Insert a new namespaces/resource/literal/literal-with-int-translation/statement into the database. |
|
single | Variant on the above which are expected to be references to stored procedures which may check for duplicates and do inline allocation of IDs. |
listSubjects |
single | list all resources which are subjects of some statement |
listNamespaces |
single | list the namespaces of all properties used in some statement |
deleteStatement |
single | delete a statement |
listModels |
single | Support for more than one model in a dB List all model names, lookup a model ID from its name, allocate a new model ID, record a new model name/ID pair. |
list listS listP listO listSP listPO listSO listSPO |
single | List all statements with the given combinations of subject, predicate and object values. |
mydatabase.config
" file from the
Template.config
and replace the database name and the name of
the SQL file appropriately.generic_generic.sql
to your mydatabase_all.sql
file and edit those to be conformant.The jena RDB package does provide support for regression testing. This isn't integrated into the mega jena regression test since you need to have appropriate databases up and running to run the tests.
The primary interface to the RDF regression tests is TestRDB
which
both exports a junit suite that you can use in larger tests and provides a main
entry point so you can interactively (or via scripts) run tests on specific
database/layout combinations. In order to use the tests you need to specify
appropriate database information using the following Java properties:
jena.rdb.test.dbbbase |
The base jdbc uri of the database |
jena.rdb.test.user |
The user name for logging in to the database. |
jena.rdb.test.password |
The password for that user. |
jena.rdb.test.dbfull |
The full jdbc uri of the database to use, this overrides the use of testrdf/jr* described below and is needed for databases like Oracle whose jdbc uri cannot be used as a base for a set of related database names. |
The RDB specific regression tests expect a database called "testrdf
"
to exist (this string is appended to the jena.rdb.test.dbbase value). This database
will be reused for all the internal tests and should be left clean.
TestRDB
will then run the standard jena regression tests on models
built using the database. If the layout mode is supposed to support more that
one model in a database it only needs a single database called "jr
"
(jena regression) and the four jena regression models will all be stored in
that. However, if the layout only supports a single model per database then
four databases are needed (jr1
through jr4
).
If the jena.rdb.test.dbfull parameter has been set then given database name is used for both the RDB core tests and the full jena regression test (multi-model layout). This workaround enables us to test Oracle configurations where the database uri does not end with the local database name but does not support regression testing of single model layouts.