[OTDev] OWL-DL performance/scalability problems

Wed Sep 8 13:49:31 CEST 2010

Excerpts from Egon Willighagen's message of Mon Sep 06 18:29:44 +0200 2010:
> On Mon, Sep 6, 2010 at 6:27 PM, Christoph Helma <helma at in-silico.ch> wrote:
> >> Otherwise, I am still not sure I understand where the exact bottleneck
> >> is... this exercise seems to indicate it is the volume of the RDF/XML
> >> serialization...
> >
> > I have the impression that the bottleneck is the insertion of statements
> > into the RDF graph, not serilization or the volume of data. I use
> > the volume of data only as indicator for the size of the RDF graph.
> >
> > BTW: Who knows the (theoretical) complexity of inserting statements into a
> > RDF graph?
> 
> Indeed. There might be indexing ongoing in the background...
> 
> >> What generates the data and how
> >> to do create the RDF? Would it be possible to skip RedLand and any
> >> other RDF library at all?
> >
> > In principle yes, but I would hate to reinvent the wheel and write
> > RDF/XML "by hand".
> 
> Fair, but a simple XML library would get you very far, and would allow
> a streaming approach...

Writing our own serializer (ntriples at the moment, because it was
easier to implement for testing) did the trick, reducing eg. processing
time from ~20 minutes to ~10 seconds  (but only with a fast string
concatination operator ("<<"), with a slow one ("+=") processing 
took much longer than with Redland ;-)).

I would like to see if other implementations (Jena) perform much more
efficiently than Redland. Maye we can run some benchmarks in Rhodes.

We have also observed, that processing time depends to a large extend on
the depth of the RDF tree, not only on the size of the dataset. And
OWL-DL (especially with tuples) makes of course larger and deeper trees
than "plain" RDF.

If the complexity of datasets has really such a dramatic impact on
processing time and resource I am not very optimistic, if we can process
complex biological data with OWL-DL.  Lets assume e.g. HTS gene
expression datasets linked with experimental conditions, data analysis
procedures, phenotype or other -omics measurements, pathway information,
... This would result in very large, deeply linked datasets. 

I would be very happy if one the computer scientists could have a look
at the theoretical properties and scalability of the datastructures that
are used to represent OWL-DL (not OWL-DL as a knowledge representation
language). If there are theoretical limititations we would have to think
about alternatives (not sure, what they could be ...)

Christoph