[OTDev] OWL-DL performance/scalability problems

Mon Sep 6 10:33:05 CEST 2010

Christoph,

Are there options in Redland to setup prefixes in RDF ?
Will looks like
...
 xmlns:dataset="http://webservices.in-silico.ch/dataset<http://webservices.in-silico.ch/dataset/112>
"
...
<ot:Dataset rdf:about="dataset/112">

instead of "http://webservices.in-silico.ch/dataset<http://webservices.in-silico.ch/dataset/112>/112"
everywhere. Prefixes can be defined for all objects.

Nina

On Mon, Sep 6, 2010 at 11:24 AM, Christoph Helma <helma at in-silico.ch> wrote:

> Dear all,
>
> Excerpts from Nina Jeliazkova's message of Fri Sep 03 16:22:32 +0200 2010:
> > In Jena one can set options for the triple memory storage - e.g.
> >
> > ModelFactory.createOntologyModel( OntModelSpec)  ,
> >
> http://jena.sourceforge.net/javadoc/com/hp/hpl/jena/ontology/OntModelSpec.html
> >
> > These differ in memory efficiency and reasoning capabilities.   Perhaps
> > Redland has something similar to use?
>
> AFAIK Redland does not have such options (apart from choosing a triple
> store) - reasoning is done by a separate library (Rasqual).
>
> > The dataset size you are reporting seem rather small ; on the other hand
> > in-memory storage has limits in any representation, just the boundaries
> are
> > different.
>
> We can switch to another triple store, but that does not solve the
> general scalability problem.
>
> Contrary to my initial assumptions I do not think that Redland libraries
> are the cause for our problems. Based on my measurements I am pretty
> convinced, that our OWL-DL representation does not scale well,
> especially when it comes to complex features that require tuples
> (computing times seem to correspond to the resulting file sizes).
>
> > >    I have the impression that our OWL-DL does not scale well especially
> for
> > > Tuples, here are some measured figures for the rdf/yaml ratio (maybe
> one of
> > > the computer scientists can have a closer look):
> > >
> > >      small dataset (85 compounds), 1 feature/compound:        6.5
> > >      medium dataset (580 compounds), 1 feature/compound:      7.4
> > >      small dataset (85 compounds), 56 features as tuples:    32
> > >      medium dataset (580 compounds), 55 features as tuples: 170
> > >
>
> I have also tried to switch to another library (RDF.rb), which did not
> resolve the problem.
>
> So we are either making a mistake in our (IST/ALU) OWL-DL implementation
> (any help is greatly appreciated - maybe the redundant representation of
> features is the culprit) or our OpenTox OWL in general does not scale well
> for
> larger datasets (especially with complex features).
>
> If you want to have a look:
> http://webservices.in-silico.ch/dataset/112
> (cached to save you the timeouts), Accept:application/x-yaml or
> http://webservices.in-silico.ch/dataset/112.yaml gives you our internal
> representation.
>
> Thanks!
> Christoph
>
> > Christoph,
> >
> >
> > Nina
> >
> > On Fri, Sep 3, 2010 at 4:37 PM, Christoph Helma <helma at in-silico.ch>
> wrote:
> >
> > > Dear all,
> > >
> > > I have been investigating several problems that we had with creating
> and
> > > serving OWL-DL representations:
> > >
> > > - slow response
> > > - gateway timeouts
> > > - memory allocation problems
> > >
> > > Both problems depend of course on the size and complexity of the
> > > datasets. Most problematic are datasets with tuples, here we run into
> > > troubles even for medium sized datasets (several hundreds of compounds)
> > > with less than 100 features. It took e.g. 20 minutes to create
> > > http://webservices.in-silico.ch/dataset/112.rdf. If datasets grow
> > > larger, we may run into memory allocation problems. All of this can be
> > > quite annoying, because
> > >
> > > - long running processes eat CPU time, slowing down other processes
> > > - tasks may timeout before processes have finished
> > > - users expect a response without getting them
> > > - users get unpatient, restarting processes which slow down the sytem
> > >  even more
> > > - memory allocation failures my crash the dataset service
> > > - ...
> > >
> > > What is probably _not_ responsible:
> > >
> > >  RDF/XML representation: Same problem for turtle, json, triples
> > >  Iteration over our internal data structures: Takes only 0.3% of the
> total
> > > processing time
> > >  Redland libraries: I have tried another library (not too much choices
> in
> > > Ruby), takes 5 times longer than with Redland.
> > >
> > > What _could_ be responsible:
> > >
> > >  Wrong/inefficient OWL-DL representation: Can one of the OWL exports
> please
> > > have a look at e.g. http://webservices.in-silico.ch/dataset/112.rdf?
> > >
> > >  OpenTox OWL-DL/Triple representation:
> > >
> > >  Symptoms:
> > >    Our internal representation (
> > > http://webservices.in-silico.ch/dataset/112.yaml) needs 90K (still
> keeping
> > > redundant information for efficient searches), OWL-DL as RDF/XML needs
> 15M
> > > (which is still 6.1M in Turtle) for the same information.
> > >    I have the impression that our OWL-DL does not scale well especially
> for
> > > Tuples, here are some measured figures for the rdf/yaml ratio (maybe
> one of
> > > the computer scientists can have a closer look):
> > >
> > >      small dataset (85 compounds), 1 feature/compound:        6.5
> > >      medium dataset (580 compounds), 1 feature/compound:      7.4
> > >      small dataset (85 compounds), 56 features as tuples:    32
> > >      medium dataset (580 compounds), 55 features as tuples: 170
> > >
> > > Possible solutions:
> > >
> > >  Curing symptoms:
> > >
> > >    Lazy generation/caching of OWL-DL representations: Implemented, you
> > > might still get timeouts at the first request/have to wait a long time
> for
> > > OWL-DL to finish, does not solve memory allocation problems
> > >    Use a persistent store instead of memory store: might solve memory
> > > allocation problems, but will slow down things even further
> > >    Get more faster hardware
> > >
> > >  Curing the cause (I am at loss here, please help):
> > >
> > >    Tell us what goes wrong in our OWL-DLs
> > >    Improve scalability of OpenTox OWL-DL especially in respect to
> tuples (I
> > > definitly need a method to represent "complex" features)
> > >
> > > IMHO it does not make much sense to proceed with further developments
> > > until we have ressolved this substantial issue. I am looking forward to
> > > hear your ideas!
> > >
> > > Best regards,
> > > Christoph
> > >
> > > PS: Martin mentioned, that he has also experienced performance problems
> in
> > > accessing the parsed OWL-DL datastructure (parsing the file seems to be
> ok)
> > > - also for external (i.e. non IST/ALU) datasets. I have always blamed
> > > Redland libraries, but maybe this is a related issue.
> > > _______________________________________________
> > > Development mailing list
> > > Development at opentox.org
> > > http://www.opentox.org/mailman/listinfo/development
> > >
> >
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>