[OTDev] Experiments with RDF

Nina Jeliazkova jeliazkova.nina at gmail.com
Fri Oct 1 16:36:44 CEST 2010


On Fri, Oct 1, 2010 at 5:12 PM, chung <chvng at mail.ntua.gr> wrote:

> On Fri, 2010-10-01 at 13:40 +0300, Nina Jeliazkova wrote:
>
> > Hi Pantelis,
> >
> > On Fri, Oct 1, 2010 at 1:20 PM, chung <chvng at mail.ntua.gr> wrote:
> >
> > > Hi Nina,
> > >  Thanks for you input!
> > >
> > >
> > > On Fri, 2010-10-01 at 08:32 +0300, Nina Jeliazkova wrote:
> > >
> > > > Hi Pantelis, All,
> > > >
> > > >
> > > > On Thu, Sep 30, 2010 at 9:33 PM, chung <chvng at mail.ntua.gr> wrote:
> > > >
> > > > > Hi all,
> > > > >  During a round (rectangle in fact) table discussion in Rhodes we
> > > > > questioned the efficiency of web services based on RDF and in
> > > particular
> > > > > its OWL-DL variant. I gathered some statistics using ToxOtis while
> > > > > experimenting with downloading and parsing datasets. Also we've
> tested
> > > > > the performance  of ToxOtis in converting dataset objects into weka
> > > > > objects (weka.core.Instances); the latter are useful to users of
> Weka.
> > > > > These are preliminary results and we must not jump into conclusions
> but
> > > > > we can start a discussion around some performance issues. Java
> > > > > developers may use ToxOtis as a kind of client-profiler for their
> > > > > services. Find attached a draft report that attempts to correlate
> the
> > > > > size of a dataset with the computational effort needed for its
> parsing.
> > > > >
> > > >
> > > >
> > > > Would it be possible to run further experiments - in particular:
> > > >
> > > > - Split the reported time into time, necessary to download the RDF
> > > > representation from the server, and time, necessary to parse and load
> the
> > > > RDF as Jena object.  The reason for asking is these two parts can be
> > > > optimized by different approaches (minimizing file size by prefixing
> or
> > > > compression for the former and exploiting different Jena storage
> models
> > > for
> > > > the later).
> > >
> > >
> > > Yes, that was in my plans.
> > >
> >
> > OK ,will be looking for the results.
> >
> >
> > > >
> > > > - Report time to parse RDF into different in-memory Jena models (ones
> > > from
> > > >
> > >
> http://jena.sourceforge.net/javadoc/com/hp/hpl/jena/ontology/OntModelSpec.html(not<http://jena.sourceforge.net/javadoc/com/hp/hpl/jena/ontology/OntModelSpec.html%28not>
> <
> http://jena.sourceforge.net/javadoc/com/hp/hpl/jena/ontology/OntModelSpec.html%28not
> >
> > > > sure which is being used in the  tests now)
> > >
> > >
> > > That would be also an interesting experiment.
> > >
> > > >
> > > > - Report timings, using slightly different approach to convert to
> weka
> > > > instances, namely , retrieve URIs of compounds first and then
> retrieve
> > > > features for each compound in subsequent calls.
> > > >
> > >
> > >
> > > Well, the time needed to convert the dataset object into Instances is
> > > relatively small. Do you think this needed to be optimized further. We
> > > can do the experiment however.
> > >
> > >
> > In my experience, Weka instances work fine up to certain limit
> (dependending
> > on memory available to JVM) , but it is not possible to work with Weka
> for
> > large datasets.  Moreover, I assume in the current test setup , the data
> is
> > at least duplicated (once in the RDF model and once as Weka instances) ,
> > which makes the memory consumption worse. Thus, the suggestion to load
> RDF
> > in small chunks (e.g. per compound), create weka instances and
> immediately
> > discard RDF instances.
> >
>
>
> Sounds good, but the core object on which we work is
> 'org.opentox.toxotis.core.Dataset'. Instances (of weka) comes as the
> result of the conversion of that object, so what we need to keep in
> memory is this object. After the parsing, we invoke the method 'close()'
> on the OntModel object. I don't know exactly what it does as the
> documentation provided by Jena is a bit fuzzy:
>
> "Close the Model and free up resources held.
> Not all implementations of Model require this method to be called. But
> some do, so in general its best to call it when done with the object,
> rather than leave it to the finalizer. "
>
>
I think it closes the  database connection for  persistent storage, but
perhaps does nothing for the in-memory storage.

But I guess it releases all resources related to the model.
>
> What we do now is that we download the RDF, use Jena to parse it into an
> OntModel and then parse the OntModel into an in house object which is
> easier to access and modify and provides methods for its conversion into
> a weka.core.Instances object etc. Do you suggest that the remote dataset
> should be downloaded in chunks or that as soon as it is downloaded we
> should parse it in chunks?
>
>
Here just for comparison with other (current) approaches - I've found for
Toxpredict it's the fastest way to get things done.


>
> >
> > >
> > > > - Report timings, when using Jena persistent storages , instead of
> > > in-memory
> > > > one (http://openjena.org/TDB/, http://openjena.org/SDB/ )
> > >
> > >
> > > I don't think that persistence will outperform the memory storage in
> > > terms of computational time but will probably allocate less memory.
> > >
> >
> > As the memory consumption is the bottleneck, it surely will do;
> persistent
> > storages are usually optimized to do so.  There are existing benchmarks
> > showing in-memory Jena performs worst.
> >
>
>
> We don't have any memory leaks :-)


Good for you :)

Seriously, it's not about leaks, but duplicating same information and
holding all that in memory - it's a finite resource even without leaks :)


> The attached image shows the resource
> allocation while downloading and processing
> http://apps.ideaconsult.net:8080/ambit2/dataset/10 (1000 compounds, 60
> features). The black arrow indicates the time instance when the
> procedure started (Purple dots indicate also the start and end
> timestamps), the red dots on the second indicate the memory allocation
> and the third figure visualizes the three parts of the procedure:
>
> 1. The request is sent to the remote server. Waiting for the response
> 2. Dataset is downloaded and parsed into an OntModel at the same time
> 3. The OntModel object is further processed.
>
>
> >
> > > Apart from that, I don't think that such persistence is needed on the
> > > client side since that data are persistent on the server.
> > >
> >
> > IMHO in-memory Jena models will simply not work for datasets > few
> thousand
> > entries , especially if the code runs as a server application (e.g.
> > ToxPredict) and should support multiple simultaneous users.
>
>
> But then, it will be also impossible to train a model based on the whole
> domain of data (like MLR or SVM); it will be though possible to "train"
> local models like kNN. I mean the memory allocation problem in that case
> is more than a matter of parsing! Moreover, there should be a blocking
> queue for the threads that cannot run simultaneously and this of course
> depends on the characteristics of the server.
>

That's why the suggestion to test the behaviour with persistent Jena model,
one uses it via the same API as the in-memory, but persistent storage
handles all the rest . BTW, Jena provides methods to lock the model.


>
> >
> > Besides time, could you record also memory related stats?
>
> Yes, that's a good idea! Is it OK if I use just the monitor of the JVM
> to record the memory allocation?
>
> I mean the following methods:
>
> Runtime.getRuntime().freeMemory();
> Runtime.getRuntime().maxMemory();
> Runtime.getRuntime().totalMemory();
> ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getUsed();
>
> and for CPU usage:
>
> ManagementFactory.getOperatingSystemMXBean().getSystemLoadAverage();
>

Let's try.


Nina


>
> Cheers,
> Pantelis
>
> >
> >
> > >
> > > >
> > > > If we find an optimal setting after these experiments, the next step
> > > would
> > > > be trying to work with datasets, comparable with size to the raw
> malaria
> > > > data.  Ideally, would be nice to compare with RDF libraries, other
> than
> > > > Jena, but this may require more efforts.
> > > >
> > >
> > >  We might find such a comparison online otherwise we could run some
> > > tests!
> > >
> >
> > Yes indeed.
> >
> > Nina
> >
> >
> > >
> > >
> > > > Best regards,
> > > > Nina
> > > >
> > > >
> > > >
> > > > >
> > > > > Best regards,
> > > > > Pantelis S.
> > > > >
> > > > > _______________________________________________
> > > > > Development mailing list
> > > > > Development at opentox.org
> > > > > http://www.opentox.org/mailman/listinfo/development
> > > > >
> > > > >
> > > > _______________________________________________
> > > > Development mailing list
> > > > Development at opentox.org
> > > > http://www.opentox.org/mailman/listinfo/development
> > > >
> > >
> > >
> > > _______________________________________________
> > > Development mailing list
> > > Development at opentox.org
> > > http://www.opentox.org/mailman/listinfo/development
> > >
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >
>
>
>
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>
>



More information about the Development mailing list