Storing RDF in a relational database

This page summarizes some current approaches to storing RDF in a relational database. THIS IS A REQUEST FOR COMMENTS, so please do contribute your ideas (www-rdf-interest@w3.org)! To goal is to come up with the best way of storing RDF in a relational database, or identify a set of solutions that are suitable for particular needs.

Motivation

We need to be able to persistently store and manipulate (large amounts of) RDF data. One of the alternatives to do that is to use the relational database technology. A major advantage of this approach is that is provides a scalable off-the-shelf solution.

Criteria

A non-exhaustive list of the criteria to be considered for the database schema design (in no particular order): The proposals shown below satisfy the above criteria in different ways. The maintainer of this page is particularly interested in addressing the scalability issue. Please submit other criteria reflecting your needs!

Publications

The following paper discusses a vertical scheme for storing and querying sparse relational tables, which is similar in spirit to some of the proposals below:

R. Agrawal, A. Somani, and Y. Xu: Storage and Querying of E-Commerce Data, Proc. VLDB 2001, Roma, Italy, available as http://www.vldb.org/conf/2001/P149.pdf

Database schemas for storing RDF

(most recently contributed first)

Explicit models

Contributed by: Brian McBride
Date: May 11, 2000
Brief summary: This representation treats models explicitly and makes use of views
Database schema (Oracle) and description by the author:
                sql = "CREATE TABLE RDFRESOURCE"
                        + "("
                        +   "Id INTEGER not null primary key,"
                        +   "NS INTEGER not null,"
                        +   "RoName varchar(255)"
                        + ")";

The resource table holds all resources.  Id is an internal
identifier field.  NS is a pointer to an entry in the namespace
table giving the namespace for this resource.  RoName should be
called 'localname' and is the local name component of the Qname.


                sql = "CREATE TABLE RDFNameSpace"
                        + "("
                        +   "Id INTEGER not null primary key,"
                        +   "NsName varchar(255)"
                        + ")";

The namespace table.
                               
                sql = "CREATE TABLE RDFLiteral"
                        + "("
                        +   "Id INTEGER not null primary key,"
                        +   "VAL varchar (4000)"
                        + ")";

Table of literals.  4000 character limit is enough for current purposes.
                
                
                sql = "CREATE TABLE RDFStatement"
                        + "("
                        +   "Id INTEGER not null primary key,"
                        +   "Subject INTEGER not null,"
                        +   "Predicate INTEGER not null,"
                        +   "ObjResource INTEGER not null,"
                  +   "ObjLiteral INTEGER not null,"
                        +   "Res CHAR(1) not null"
                        + ")";

Statement table.  Originally had a single object field which
would have the ID of either an object or a literal.  A
complicated JOIN expression is used to enumerate a list of
statements and was behaving unexpectedly.  My inexperience
with SQL probably.  This works and feels more 'correct'.  Res is
a flag to say whether the object is a resource or literal.
                
                sql = "CREATE TABLE RDFModel"
                        + "("
                        +   "ModelId INTEGER not null,"
                        +   "Statement INTEGER not null,"
                        +   "Asserted CHAR(1) not null,"
                        +   "Reified CHAR(1) not null,"
                        +   "primary key(ModelId, Statement)"
                        + ")";

The database can hold multiple models.  This table keeps a list of 
the statements in each model.  Originally this table was combined with
the statement table but that didn't work when it came to
implementing set operations. 

Asserted is a flag which says this statement is asserted in this model.
Reified is a flag which says this statement is reified in this model.
The latter is a hook for future implementation.  Reification isn't 
implemented so this approach is untested.

Each model is a resource and has an entry in the resource table.
The ModelId field is the Id of that resource.  Thus
statements can be made about the model.  There is a class for models
and it is possible to list the schemas that need to be loaded when
validating a model.
                
                sql = "CREATE OR REPLACE VIEW RootModel"
                     +  " AS SELECT UNIQUE Id, Subject, Predicate,
ObjResource, ObjLiteral, Res, Asserted, Reified"
                     +    " FROM RDFModel, RDFStatement"
                     +    " WHERE RDFModel.Statement = RDFStatement.Id";

This creates a view of an artificial model which contains all the statements
in the database, whatever 'actual' model they are in.

Views are heavily used.  Every model is a view which is a query onto either
another model view or the RootModel view.  So every time any operation in
the Stanford API which creates a model is called, e.g. any query, a new view
is created in the database.  This must lead to some bizarre queries being
fed to the database query engine, and I'm relying on its query optimiser to
sort those out.  There is also a problem with stale views being left around
in the database should an application crash.

                sql = "CREATE TABLE RDFKEYS"
                        + "("
                        +   "TableName char(10) not null primary key,"
                        +   "Key INTEGER not null"
                        + ")";
                sql = "INSERT INTO RDFKEYS (TableName, Key)
VALUES('Resource', 0)";
                sql = "INSERT INTO RDFKEYS (TableName, Key)
VALUES('NameSpace', 0)";

A key generation table.  Probably could use a sequencer, but they seem to be
a bit database specific and I started out at least with the intent not to
be. Only really need one generator though.

There was a question about schemas.  Right now, schemas are held as models
and can be fed into the schema validator.  No attempt has been made to use a
schema to define a more specific database structure.

Original posting:http://lists.w3.org/Archives/Public/www-rdf-interest/2000May/0094.html

Specs loyal

Contributed by: Jonas Liljegren
Brief summary: An attempt to provide a compact way of implementing every detail in the RDF model and schema specs.
Database schema (Postgres): CREATE TABLE statements
(
    id                   int,
    pred                 int,
    subj                 int,
    obj                  int,
    fact                 int    -- 0 if only a reified statement
);

-- If a statement referes to a nonexistent resource, it is taken to be
-- an internal statement with the name local#nr, there local is the
-- locale namespace and nr is the id number. But if there is a
-- statement with the same name, it is taken to be a reference to the
-- reified statement. 

CREATE TABLE resources
(
    id                   int,
    uri                  text,
    value                text,
    isprefix             int,
    lang                 text
);

CREATE TABLE prefix
(
    from                 int,
    to                   int
);

CREATE SEQUENCE internal_id;

Description: by the author

Hashed with origin

Contributed by: Sergey Melnik
Brief summary: In this approach the main table triples stores only the CRC64 hash values of the resources and literals. Moreover, every triple has an associated hash value of the URI where it came from (model). Resource URIs are stored in the table resources, literals are stored in literals. Resources themselves are split into namespaces and names, the former being stored as a CRC64 hashes. This scalability of this solution has been evaluated against the Open Directory data using MySQL. The table models is optional in the sense that the metadata about models could be stored in triples themselves.
Database schema (MySQL): CREATE TABLE triples (
  model bigint(20) DEFAULT '0' NOT NULL,
  subject bigint(20) DEFAULT '0' NOT NULL,
  predicate bigint(20) DEFAULT '0' NOT NULL,
  object bigint(20) DEFAULT '0' NOT NULL,
  objtype tinyint(3) unsigned DEFAULT '0' NOT NULL,
  KEY idx_subject_predicate (subject,predicate),
  KEY idx_model (model),
  KEY idx_object_predicate (object,predicate)
);

CREATE TABLE resources (
  hash bigint(20) DEFAULT '0' NOT NULL,
  ns bigint(20) DEFAULT '0' NOT NULL,
  name varchar(254) DEFAULT '' NOT NULL,
  PRIMARY KEY (hash)
);

CREATE TABLE namespaces (
  hash bigint(20) DEFAULT '0' NOT NULL,
  value varchar(254) DEFAULT '' NOT NULL,
  PRIMARY KEY (hash)
);

CREATE TABLE literals (
  hash bigint(20) DEFAULT '0' NOT NULL,
  value longtext NOT NULL,
  PRIMARY KEY (hash)
);

CREATE TABLE models (
  id bigint(20) DEFAULT '0' NOT NULL,
  uri varchar(254) DEFAULT '' NOT NULL,
  size int(11),
  PRIMARY KEY (id)
);

Usage: A query selecting all triples with the given (precomputed) subject, predicate and object looks like this:

SELECT m.uri, t.objtype, concat(n1.value,r1.name) as subj,
       concat(n2.value,r2.name) as pred, concat(n3.value,r3.name), l.value
FROM triples t, models m, resources r1, resources r2, namespaces n1, namespaces n2
     LEFT JOIN literals l ON t.object=l.hash
     LEFT JOIN resources r3 ON t.object=r3.hash
     LEFT JOIN namespaces n3 ON r3.ns=n3.hash
WHERE t.subject=r1.hash AND r1.ns=n1.hash AND t.predicate=r2.hash AND
      r2.ns=n2.hash AND m.id=t.model AND
      t.subject=XXX AND t.predicate=YYY AND t.object==ZZZ

To-be-named

To be contributed by: Eric Prudhomeux
Brief summary: "Clever mappings into SQL. Algernon-based query language that is rewritten into SQL query or into in-memory
graph queries."
Database schema:

The naive approach

Contributed by: Eric Miller
Brief summary: This is probably the simplest way of dealing with RDF one could think of. Naive doesn't mean bad. Many applications that need a simple RDF store could do with this solution.
Database schema: CREATE TABLE triple (
  property varchar(255),
  resource varchar(255),
  value blob,
  hint char(1)
);

If you have an RDBMS solution for your RDF data, please contribute!



Sergey Melnik, May 11, 2000
Last modified: Dec 3, 2001