Description of the database implementation

By: Jonas Liljegren

In context of: storing RDF in a relational database
Current status here

Data model

The data model is designed to save as much space as possible and to implement every detail in the RDF and schema specification. I am
planning to solve the speed problem with intelligent use och caching and maby even second order databases, generated from the core database.

In a big RDF database, there will be lots of indirect resources. These resources doesn't have to be stored as separate records in the resource table. If that is the case, they are taken to be local recources.

Every statement is in it self a resource. The statement is represented as a resource in the way used with reified statements. I am saving a lot of space (in fact: infinit space) in not storing the model of the statement. I am letting every statement have a URI based on the statement ID and the local URI. (I now understand that I must also store the full URI base for the statements imported from foreign bases.)

An effect of using the statement table for representing both reified statements and normal statements, is that I must differ between statemnets that only is reefied, and not "facts". That is the role of the FACT field.

Everything has an URI. Statements has an URI. My conclusion is that even the internal literals can be thought of as resources. The RDF spec makes a very clear distinction between literals and resources. But it makes for much more symetry if you could store the URI content in the literal. And you don't have to have an extra field in the STATEMENT table to tell if the object is an literal or resource. And more: both types can have a language tagging. In the database; the object is taken to be a literal if and only if the VALUE field is defined. Discussion here:

http://paranormal.o.se/perl/proj/rdf/schema_editor/letters/literals.txt

I was of course very temted to breake out the URI to a separate table for namespaces. It would save much space. But I can't see how to do it, if I want to confirm to the AboutEachPrefix:

http://paranormal.o.se/perl/proj/rdf/schema_editor/letters/namespaces.txt

http://paranormal.o.se/perl/proj/rdf/schema_editor/letters/aboutEachPrefix.txt

aboutEachPrefix (AEP) resources is marked in the RESOURCES table. The PREFIX table is (going to be) used as a chache for mapping from a resouce to a AEP. If I want to know all attributes of a resource, I would have to both examine the STATEMENTS table for SUBJECTS with that id. I would also have to get a list of resource IDs from the PREFIX table - one for AEP that includes the resource - and check the STATEMENTS table for each of the collected IDs. The PREFIX table would have to be updated for every added resource, by getting all AEPs and for each of them, see if the added resource has that prefix. A total reconstruction can be made by, for every AEP, do a SQL search, utilizing the SQL "like 'namespace%'" construct. That type of search is a MUST for solving the AEP. And THAT is not possible if a separate namespace table is used INSTEAD of a full URI. It would be nice if I wouldn't have to do this. But You can't just know what part of an URI is suppoesed to be it's namespace.

Since both the STATEMENTS table and the RESOURCES table both represent resources, every ID is unique. There is no statement that has the same ID as a record in the resource table. They use the same sequence.

Caching

The RDF Schema specification describes some objects and properties. These are represented in my "RDF Schema editor", in the same way as all other data; with triples. But the properties for those objects will not change. They can therefore be chached indefinitly.

The main content of the database will represent the changing and growing knowledge of the community. What is to be considered 'fact' will therefore change with time. Objects will get new properties, etc. This means that those data can't be cached indefinitly.

The schema editor will often request the same info many times in a single request. (Especially in a sorting function. ;-) Caching within a request could therefore be done.

The best fing would be to tie the cache to namespaces. Every namespace would have a change date and a cahce. The date would be checked at the start of every request and the whole cache would be flushed if updated. This handling of the cache would have to solve the same problem as that of AEP.

You don't HAVE to have the latest version of the data. Even if data changes, the cahe could be kept some time depending of your tolerance of getting a little old data. If a person adds new data, he would certainly want to see his addition take effect. So for this person, the chache would be flushed for the namespace involved. I also think that the user would like to flush the cache on will.

I have a couple of hundred objects in my database. The most complex page is a autogenerated form of every property you can add to an object, and all the resources you can chose from (the range), for each property. You can add a bag or multipple values for a property, or a combination of both. (The goal is to allow EVERYTHING that RDF allows.) Example:

http://paranormal.o.se/perl/proj/rdf/schema_editor/devel/latest/rdf.cgi?state=edit_form&uri=%3AName

I am using mod_perl (and Apache::Registry) for caching the compiled classes and utilize a constant link to the postgreSQL database (using perl DBI).

The above page would take above 30 seconds to generate. A large part of this time was used to sort the values of the drop-down boxes. But every list involves recursive functions to get all the resources for a range.

By caching some of the method return values, I could cut the time to under 10 seconds. But this cache is generated from scaratch for every single request. And it is still ineffective to sort resources on properties that has to be looked up from either the database or the cache. (That should be optimized.)

Of course, I haven't yet constructed any indices. But if I had them, maby I wouldn't give as much thought to the caching and other speed gains. I could also use limited char fields. But how do I know how long an URI is? Very old browses has an URL limit of 255 chars. But that is not enough. The PostgreSQL "text" datatype can be of any length.

The page above represents an extreme circumstance. A normal use of the RDF database would be to extract some connections for a specific object, not enumerate every single object and property multipple times. This would be a more normal (dynamic) form:

http://paranormal.o.se/perl/proj/rdf/schema_editor/devel/latest/rdf.cgi?state=edit_form&uri=%3A8