[OTDev] Experiments with RDF
Nina Jeliazkova jeliazkova.nina at gmail.comFri Oct 1 16:36:44 CEST 2010
- Previous message: [OTDev] Experiments with RDF
- Next message: [OTDev] Experiments with RDF
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, Oct 1, 2010 at 5:12 PM, chung <chvng at mail.ntua.gr> wrote: > On Fri, 2010-10-01 at 13:40 +0300, Nina Jeliazkova wrote: > > > Hi Pantelis, > > > > On Fri, Oct 1, 2010 at 1:20 PM, chung <chvng at mail.ntua.gr> wrote: > > > > > Hi Nina, > > > Thanks for you input! > > > > > > > > > On Fri, 2010-10-01 at 08:32 +0300, Nina Jeliazkova wrote: > > > > > > > Hi Pantelis, All, > > > > > > > > > > > > On Thu, Sep 30, 2010 at 9:33 PM, chung <chvng at mail.ntua.gr> wrote: > > > > > > > > > Hi all, > > > > > During a round (rectangle in fact) table discussion in Rhodes we > > > > > questioned the efficiency of web services based on RDF and in > > > particular > > > > > its OWL-DL variant. I gathered some statistics using ToxOtis while > > > > > experimenting with downloading and parsing datasets. Also we've > tested > > > > > the performance of ToxOtis in converting dataset objects into weka > > > > > objects (weka.core.Instances); the latter are useful to users of > Weka. > > > > > These are preliminary results and we must not jump into conclusions > but > > > > > we can start a discussion around some performance issues. Java > > > > > developers may use ToxOtis as a kind of client-profiler for their > > > > > services. Find attached a draft report that attempts to correlate > the > > > > > size of a dataset with the computational effort needed for its > parsing. > > > > > > > > > > > > > > > > > Would it be possible to run further experiments - in particular: > > > > > > > > - Split the reported time into time, necessary to download the RDF > > > > representation from the server, and time, necessary to parse and load > the > > > > RDF as Jena object. The reason for asking is these two parts can be > > > > optimized by different approaches (minimizing file size by prefixing > or > > > > compression for the former and exploiting different Jena storage > models > > > for > > > > the later). > > > > > > > > > Yes, that was in my plans. > > > > > > > OK ,will be looking for the results. > > > > > > > > > > > > - Report time to parse RDF into different in-memory Jena models (ones > > > from > > > > > > > > http://jena.sourceforge.net/javadoc/com/hp/hpl/jena/ontology/OntModelSpec.html(not<http://jena.sourceforge.net/javadoc/com/hp/hpl/jena/ontology/OntModelSpec.html%28not> > < > http://jena.sourceforge.net/javadoc/com/hp/hpl/jena/ontology/OntModelSpec.html%28not > > > > > > sure which is being used in the tests now) > > > > > > > > > That would be also an interesting experiment. > > > > > > > > > > > - Report timings, using slightly different approach to convert to > weka > > > > instances, namely , retrieve URIs of compounds first and then > retrieve > > > > features for each compound in subsequent calls. > > > > > > > > > > > > > Well, the time needed to convert the dataset object into Instances is > > > relatively small. Do you think this needed to be optimized further. We > > > can do the experiment however. > > > > > > > > In my experience, Weka instances work fine up to certain limit > (dependending > > on memory available to JVM) , but it is not possible to work with Weka > for > > large datasets. Moreover, I assume in the current test setup , the data > is > > at least duplicated (once in the RDF model and once as Weka instances) , > > which makes the memory consumption worse. Thus, the suggestion to load > RDF > > in small chunks (e.g. per compound), create weka instances and > immediately > > discard RDF instances. > > > > > Sounds good, but the core object on which we work is > 'org.opentox.toxotis.core.Dataset'. Instances (of weka) comes as the > result of the conversion of that object, so what we need to keep in > memory is this object. After the parsing, we invoke the method 'close()' > on the OntModel object. I don't know exactly what it does as the > documentation provided by Jena is a bit fuzzy: > > "Close the Model and free up resources held. > Not all implementations of Model require this method to be called. But > some do, so in general its best to call it when done with the object, > rather than leave it to the finalizer. " > > I think it closes the database connection for persistent storage, but perhaps does nothing for the in-memory storage. But I guess it releases all resources related to the model. > > What we do now is that we download the RDF, use Jena to parse it into an > OntModel and then parse the OntModel into an in house object which is > easier to access and modify and provides methods for its conversion into > a weka.core.Instances object etc. Do you suggest that the remote dataset > should be downloaded in chunks or that as soon as it is downloaded we > should parse it in chunks? > > Here just for comparison with other (current) approaches - I've found for Toxpredict it's the fastest way to get things done. > > > > > > > > > > - Report timings, when using Jena persistent storages , instead of > > > in-memory > > > > one (http://openjena.org/TDB/, http://openjena.org/SDB/ ) > > > > > > > > > I don't think that persistence will outperform the memory storage in > > > terms of computational time but will probably allocate less memory. > > > > > > > As the memory consumption is the bottleneck, it surely will do; > persistent > > storages are usually optimized to do so. There are existing benchmarks > > showing in-memory Jena performs worst. > > > > > We don't have any memory leaks :-) Good for you :) Seriously, it's not about leaks, but duplicating same information and holding all that in memory - it's a finite resource even without leaks :) > The attached image shows the resource > allocation while downloading and processing > http://apps.ideaconsult.net:8080/ambit2/dataset/10 (1000 compounds, 60 > features). The black arrow indicates the time instance when the > procedure started (Purple dots indicate also the start and end > timestamps), the red dots on the second indicate the memory allocation > and the third figure visualizes the three parts of the procedure: > > 1. The request is sent to the remote server. Waiting for the response > 2. Dataset is downloaded and parsed into an OntModel at the same time > 3. The OntModel object is further processed. > > > > > > > Apart from that, I don't think that such persistence is needed on the > > > client side since that data are persistent on the server. > > > > > > > IMHO in-memory Jena models will simply not work for datasets > few > thousand > > entries , especially if the code runs as a server application (e.g. > > ToxPredict) and should support multiple simultaneous users. > > > But then, it will be also impossible to train a model based on the whole > domain of data (like MLR or SVM); it will be though possible to "train" > local models like kNN. I mean the memory allocation problem in that case > is more than a matter of parsing! Moreover, there should be a blocking > queue for the threads that cannot run simultaneously and this of course > depends on the characteristics of the server. > That's why the suggestion to test the behaviour with persistent Jena model, one uses it via the same API as the in-memory, but persistent storage handles all the rest . BTW, Jena provides methods to lock the model. > > > > > Besides time, could you record also memory related stats? > > Yes, that's a good idea! Is it OK if I use just the monitor of the JVM > to record the memory allocation? > > I mean the following methods: > > Runtime.getRuntime().freeMemory(); > Runtime.getRuntime().maxMemory(); > Runtime.getRuntime().totalMemory(); > ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getUsed(); > > and for CPU usage: > > ManagementFactory.getOperatingSystemMXBean().getSystemLoadAverage(); > Let's try. Nina > > Cheers, > Pantelis > > > > > > > > > > > > > > > > If we find an optimal setting after these experiments, the next step > > > would > > > > be trying to work with datasets, comparable with size to the raw > malaria > > > > data. Ideally, would be nice to compare with RDF libraries, other > than > > > > Jena, but this may require more efforts. > > > > > > > > > > We might find such a comparison online otherwise we could run some > > > tests! > > > > > > > Yes indeed. > > > > Nina > > > > > > > > > > > > > > Best regards, > > > > Nina > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > Pantelis S. > > > > > > > > > > _______________________________________________ > > > > > Development mailing list > > > > > Development at opentox.org > > > > > http://www.opentox.org/mailman/listinfo/development > > > > > > > > > > > > > > _______________________________________________ > > > > Development mailing list > > > > Development at opentox.org > > > > http://www.opentox.org/mailman/listinfo/development > > > > > > > > > > > > > _______________________________________________ > > > Development mailing list > > > Development at opentox.org > > > http://www.opentox.org/mailman/listinfo/development > > > > > _______________________________________________ > > Development mailing list > > Development at opentox.org > > http://www.opentox.org/mailman/listinfo/development > > > > > > _______________________________________________ > Development mailing list > Development at opentox.org > http://www.opentox.org/mailman/listinfo/development > >
- Previous message: [OTDev] Experiments with RDF
- Next message: [OTDev] Experiments with RDF
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list