[OTDev] Experiments with RDF

chung chvng at mail.ntua.gr
Fri Oct 1 16:12:24 CEST 2010


On Fri, 2010-10-01 at 13:40 +0300, Nina Jeliazkova wrote:

> Hi Pantelis,
> 
> On Fri, Oct 1, 2010 at 1:20 PM, chung <chvng at mail.ntua.gr> wrote:
> 
> > Hi Nina,
> >  Thanks for you input!
> >
> >
> > On Fri, 2010-10-01 at 08:32 +0300, Nina Jeliazkova wrote:
> >
> > > Hi Pantelis, All,
> > >
> > >
> > > On Thu, Sep 30, 2010 at 9:33 PM, chung <chvng at mail.ntua.gr> wrote:
> > >
> > > > Hi all,
> > > >  During a round (rectangle in fact) table discussion in Rhodes we
> > > > questioned the efficiency of web services based on RDF and in
> > particular
> > > > its OWL-DL variant. I gathered some statistics using ToxOtis while
> > > > experimenting with downloading and parsing datasets. Also we've tested
> > > > the performance  of ToxOtis in converting dataset objects into weka
> > > > objects (weka.core.Instances); the latter are useful to users of Weka.
> > > > These are preliminary results and we must not jump into conclusions but
> > > > we can start a discussion around some performance issues. Java
> > > > developers may use ToxOtis as a kind of client-profiler for their
> > > > services. Find attached a draft report that attempts to correlate the
> > > > size of a dataset with the computational effort needed for its parsing.
> > > >
> > >
> > >
> > > Would it be possible to run further experiments - in particular:
> > >
> > > - Split the reported time into time, necessary to download the RDF
> > > representation from the server, and time, necessary to parse and load the
> > > RDF as Jena object.  The reason for asking is these two parts can be
> > > optimized by different approaches (minimizing file size by prefixing or
> > > compression for the former and exploiting different Jena storage models
> > for
> > > the later).
> >
> >
> > Yes, that was in my plans.
> >
> 
> OK ,will be looking for the results.
> 
> 
> > >
> > > - Report time to parse RDF into different in-memory Jena models (ones
> > from
> > >
> > http://jena.sourceforge.net/javadoc/com/hp/hpl/jena/ontology/OntModelSpec.html(not<http://jena.sourceforge.net/javadoc/com/hp/hpl/jena/ontology/OntModelSpec.html%28not>
> > > sure which is being used in the  tests now)
> >
> >
> > That would be also an interesting experiment.
> >
> > >
> > > - Report timings, using slightly different approach to convert to weka
> > > instances, namely , retrieve URIs of compounds first and then retrieve
> > > features for each compound in subsequent calls.
> > >
> >
> >
> > Well, the time needed to convert the dataset object into Instances is
> > relatively small. Do you think this needed to be optimized further. We
> > can do the experiment however.
> >
> >
> In my experience, Weka instances work fine up to certain limit (dependending
> on memory available to JVM) , but it is not possible to work with Weka for
> large datasets.  Moreover, I assume in the current test setup , the data is
> at least duplicated (once in the RDF model and once as Weka instances) ,
> which makes the memory consumption worse. Thus, the suggestion to load RDF
> in small chunks (e.g. per compound), create weka instances and immediately
> discard RDF instances.
> 

 
Sounds good, but the core object on which we work is
'org.opentox.toxotis.core.Dataset'. Instances (of weka) comes as the
result of the conversion of that object, so what we need to keep in
memory is this object. After the parsing, we invoke the method 'close()'
on the OntModel object. I don't know exactly what it does as the
documentation provided by Jena is a bit fuzzy:

"Close the Model and free up resources held. 
Not all implementations of Model require this method to be called. But
some do, so in general its best to call it when done with the object,
rather than leave it to the finalizer. "

But I guess it releases all resources related to the model.

What we do now is that we download the RDF, use Jena to parse it into an
OntModel and then parse the OntModel into an in house object which is
easier to access and modify and provides methods for its conversion into
a weka.core.Instances object etc. Do you suggest that the remote dataset
should be downloaded in chunks or that as soon as it is downloaded we
should parse it in chunks? 


> 
> >
> > > - Report timings, when using Jena persistent storages , instead of
> > in-memory
> > > one (http://openjena.org/TDB/, http://openjena.org/SDB/ )
> >
> >
> > I don't think that persistence will outperform the memory storage in
> > terms of computational time but will probably allocate less memory.
> >
> 
> As the memory consumption is the bottleneck, it surely will do; persistent
> storages are usually optimized to do so.  There are existing benchmarks
> showing in-memory Jena performs worst.
> 


We don't have any memory leaks :-) The attached image shows the resource
allocation while downloading and processing
http://apps.ideaconsult.net:8080/ambit2/dataset/10 (1000 compounds, 60
features). The black arrow indicates the time instance when the
procedure started (Purple dots indicate also the start and end
timestamps), the red dots on the second indicate the memory allocation
and the third figure visualizes the three parts of the procedure:

1. The request is sent to the remote server. Waiting for the response
2. Dataset is downloaded and parsed into an OntModel at the same time
3. The OntModel object is further processed.


> 
> > Apart from that, I don't think that such persistence is needed on the
> > client side since that data are persistent on the server.
> >
> 
> IMHO in-memory Jena models will simply not work for datasets > few thousand
> entries , especially if the code runs as a server application (e.g.
> ToxPredict) and should support multiple simultaneous users.


But then, it will be also impossible to train a model based on the whole
domain of data (like MLR or SVM); it will be though possible to "train"
local models like kNN. I mean the memory allocation problem in that case
is more than a matter of parsing! Moreover, there should be a blocking
queue for the threads that cannot run simultaneously and this of course
depends on the characteristics of the server. 

> 
> Besides time, could you record also memory related stats?

Yes, that's a good idea! Is it OK if I use just the monitor of the JVM
to record the memory allocation? 

I mean the following methods:

Runtime.getRuntime().freeMemory();
Runtime.getRuntime().maxMemory();
Runtime.getRuntime().totalMemory();
ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getUsed();

and for CPU usage:

ManagementFactory.getOperatingSystemMXBean().getSystemLoadAverage();

Cheers,
Pantelis

> 
> 
> >
> > >
> > > If we find an optimal setting after these experiments, the next step
> > would
> > > be trying to work with datasets, comparable with size to the raw malaria
> > > data.  Ideally, would be nice to compare with RDF libraries, other than
> > > Jena, but this may require more efforts.
> > >
> >
> >  We might find such a comparison online otherwise we could run some
> > tests!
> >
> 
> Yes indeed.
> 
> Nina
> 
> 
> >
> >
> > > Best regards,
> > > Nina
> > >
> > >
> > >
> > > >
> > > > Best regards,
> > > > Pantelis S.
> > > >
> > > > _______________________________________________
> > > > Development mailing list
> > > > Development at opentox.org
> > > > http://www.opentox.org/mailman/listinfo/development
> > > >
> > > >
> > > _______________________________________________
> > > Development mailing list
> > > Development at opentox.org
> > > http://www.opentox.org/mailman/listinfo/development
> > >
> >
> >
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
> 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: ResourceAllocation.jpeg
Type: image/jpeg
Size: 296691 bytes
Desc: not available
URL: <http://lists.opentox.org/pipermail/development/attachments/20101001/e1173c4f/attachment.jpeg>


More information about the Development mailing list