Storing RDF in a relational database

This page summarizes some current approaches to storing RDF in a relational database. THIS IS A REQUEST FOR COMMENTS, so please do contribute your ideas (www-rdf-interest@w3.org)! To goal is to come up with the best way of storing RDF in a relational database, or identify a set of solutions that are suitable for particular needs.

Motivation

We need to be able to persistently store and manipulate (large amounts of) RDF data. One of the alternatives to do that is to use the relational database technology. A major advantage of this approach is that is provides a scalable off-the-shelf solution.

Criteria

A non-exhaustive list of the criteria to be considered for the database schema design (in no particular order):

Scalability: can we store and query 1B+ triples?
Querying: what kind of queries are supported? can they be easily formulated and processed?
Efficiency: how expensive are the queries? what are the costs of delivering the result?
Optimization: how can we handle reification?
Organization: what is the overhead associated with storing the data? can we distinguishably mix RDF models still being able to determine where the triples come from?

The proposals shown below satisfy the above criteria in different ways. The maintainer of this page is particularly interested in addressing the scalability issue. Please submit other criteria reflecting your needs!

Publications

The following paper discusses a vertical scheme for storing and querying sparse relational tables, which is similar in spirit to some of the proposals below:

R. Agrawal, A. Somani, and Y. Xu: Storage and Querying of E-Commerce Data, Proc. VLDB 2001, Roma, Italy, available as http://www.vldb.org/conf/2001/P149.pdf

Database schemas for storing RDF

(most recently contributed first)

Explicit models

Contributed by:	Brian McBride
Date:	May 11, 2000
Brief summary:	This representation treats models explicitly and makes use of views
Database schema (Oracle) and description by the author:	sql = "CREATE TABLE RDFRESOURCE" + "(" + "Id INTEGER not null primary key," + "NS INTEGER not null," + "RoName varchar(255)" + ")"; The resource table holds all resources. Id is an internal identifier field. NS is a pointer to an entry in the namespace table giving the namespace for this resource. RoName should be called 'localname' and is the local name component of the Qname. sql = "CREATE TABLE RDFNameSpace" + "(" + "Id INTEGER not null primary key," + "NsName varchar(255)" + ")"; The namespace table. sql = "CREATE TABLE RDFLiteral" + "(" + "Id INTEGER not null primary key," + "VAL varchar (4000)" + ")"; Table of literals. 4000 character limit is enough for current purposes. sql = "CREATE TABLE RDFStatement" + "(" + "Id INTEGER not null primary key," + "Subject INTEGER not null," + "Predicate INTEGER not null," + "ObjResource INTEGER not null," + "ObjLiteral INTEGER not null," + "Res CHAR(1) not null" + ")"; Statement table. Originally had a single object field which would have the ID of either an object or a literal. A complicated JOIN expression is used to enumerate a list of statements and was behaving unexpectedly. My inexperience with SQL probably. This works and feels more 'correct'. Res is a flag to say whether the object is a resource or literal. sql = "CREATE TABLE RDFModel" + "(" + "ModelId INTEGER not null," + "Statement INTEGER not null," + "Asserted CHAR(1) not null," + "Reified CHAR(1) not null," + "primary key(ModelId, Statement)" + ")"; The database can hold multiple models. This table keeps a list of the statements in each model. Originally this table was combined with the statement table but that didn't work when it came to implementing set operations. Asserted is a flag which says this statement is asserted in this model. Reified is a flag which says this statement is reified in this model. The latter is a hook for future implementation. Reification isn't implemented so this approach is untested. Each model is a resource and has an entry in the resource table. The ModelId field is the Id of that resource. Thus statements can be made about the model. There is a class for models and it is possible to list the schemas that need to be loaded when validating a model. sql = "CREATE OR REPLACE VIEW RootModel" + " AS SELECT UNIQUE Id, Subject, Predicate, ObjResource, ObjLiteral, Res, Asserted, Reified" + " FROM RDFModel, RDFStatement" + " WHERE RDFModel.Statement = RDFStatement.Id"; This creates a view of an artificial model which contains all the statements in the database, whatever 'actual' model they are in. Views are heavily used. Every model is a view which is a query onto either another model view or the RootModel view. So every time any operation in the Stanford API which creates a model is called, e.g. any query, a new view is created in the database. This must lead to some bizarre queries being fed to the database query engine, and I'm relying on its query optimiser to sort those out. There is also a problem with stale views being left around in the database should an application crash. sql = "CREATE TABLE RDFKEYS" + "(" + "TableName char(10) not null primary key," + "Key INTEGER not null" + ")"; sql = "INSERT INTO RDFKEYS (TableName, Key) VALUES('Resource', 0)"; sql = "INSERT INTO RDFKEYS (TableName, Key) VALUES('NameSpace', 0)"; A key generation table. Probably could use a sequencer, but they seem to be a bit database specific and I started out at least with the intent not to be. Only really need one generator though. There was a question about schemas. Right now, schemas are held as models and can be fed into the schema validator. No attempt has been made to use a schema to define a more specific database structure.
Original posting:	http://lists.w3.org/Archives/Public/www-rdf-interest/2000May/0094.html

Specs loyal

Contributed by:	Jonas Liljegren
Brief summary:	An attempt to provide a compact way of implementing every detail in the RDF model and schema specs.
Database schema (Postgres):	`CREATE TABLE statements` `(` `id int,` `pred int,` `subj int,` `obj int,` `fact int -- 0 if only a reified statement` `);` `-- If a statement referes to a nonexistent resource, it is taken to be` `-- an internal statement with the name local#nr, there local is the` `-- locale namespace and nr is the id number. But if there is a` `-- statement with the same name, it is taken to be a reference to the` `-- reified statement.` `CREATE TABLE resources` `(` `id int,` `uri text,` `value text,` `isprefix int,` `lang text` `);` `CREATE TABLE prefix` `(` `from int,` `to int` `);` `CREATE SEQUENCE internal_id;`
Description:	by the author

Hashed with origin

Contributed by:	Sergey Melnik
Brief summary:	In this approach the main table `triples` stores only the CRC64 hash values of the resources and literals. Moreover, every triple has an associated hash value of the URI where it came from (`model`). Resource URIs are stored in the table `resources`, literals are stored in `literals`. Resources themselves are split into namespaces and names, the former being stored as a CRC64 hashes. This scalability of this solution has been evaluated against the Open Directory data using MySQL. The table `models` is optional in the sense that the metadata about models could be stored in `triples` themselves.
Database schema (MySQL):	`CREATE TABLE triples (` `model bigint(20) DEFAULT '0' NOT NULL,` `subject bigint(20) DEFAULT '0' NOT NULL,` `predicate bigint(20) DEFAULT '0' NOT NULL,` `object bigint(20) DEFAULT '0' NOT NULL,` `objtype tinyint(3) unsigned DEFAULT '0' NOT NULL,` `KEY idx_subject_predicate (subject,predicate),` `KEY idx_model (model),` `KEY idx_object_predicate (object,predicate)` `);` `CREATE TABLE resources (` `hash bigint(20) DEFAULT '0' NOT NULL,` `ns bigint(20) DEFAULT '0' NOT NULL,` `name varchar(254) DEFAULT '' NOT NULL,` `PRIMARY KEY (hash)` `);` `CREATE TABLE namespaces (` `hash bigint(20) DEFAULT '0' NOT NULL,` `value varchar(254) DEFAULT '' NOT NULL,` `PRIMARY KEY (hash)` `);` `CREATE TABLE literals (` `hash bigint(20) DEFAULT '0' NOT NULL,` `value longtext NOT NULL,` `PRIMARY KEY (hash)` `);` `CREATE TABLE models (` `id bigint(20) DEFAULT '0' NOT NULL,` `uri varchar(254) DEFAULT '' NOT NULL,` `size int(11),` `PRIMARY KEY (id)` `);`
Usage:	A query selecting all triples with the given (precomputed) subject, predicate and object looks like this: `SELECT m.uri, t.objtype, concat(n1.value,r1.name) as subj,` `concat(n2.value,r2.name) as pred, concat(n3.value,r3.name), l.value` `FROM triples t, models m, resources r1, resources r2, namespaces n1, namespaces n2` `LEFT JOIN literals l ON t.object=l.hash` `LEFT JOIN resources r3 ON t.object=r3.hash` `LEFT JOIN namespaces n3 ON r3.ns=n3.hash` `WHERE t.subject=r1.hash AND r1.ns=n1.hash AND t.predicate=r2.hash AND` `r2.ns=n2.hash AND m.id=t.model AND` `t.subject=XXX AND t.predicate=YYY AND t.object==ZZZ`

To-be-named

To be contributed by:	Eric Prudhomeux
Brief summary:	"Clever mappings into SQL. Algernon-based query language that is rewritten into SQL query or into in-memory graph queries."
Database schema:

The naive approach

Contributed by:	Eric Miller
Brief summary:	This is probably the simplest way of dealing with RDF one could think of. Naive doesn't mean bad. Many applications that need a simple RDF store could do with this solution.
Database schema:	`CREATE TABLE triple (` `property varchar(255),` `resource varchar(255),` `value blob,` `hint char(1)` `);`

If you have an RDBMS solution for your RDF data, please contribute!

Sergey Melnik, May 11, 2000
Last modified: Dec 3, 2001