[OTDev] RDF in OpenTox

Fri May 27 23:48:34 CEST 2011

Hi Egon,
    Thank a lot for your comments.

On Fri, 2011-05-27 at 19:27 +0200, Egon Willighagen wrote: 

> Dear Pantelis,
> 
> On Fri, May 27, 2011 at 5:50 PM, chung <chvng at mail.ntua.gr> wrote:
> > Some criticism on RDF from the experience we've gained in OpenTox :
> > http://is.gd/qLJG3h . The article is not complete yet and will be
> > enriched with more facts and diagrams.
> 
> Please do, because right now you left out so much detail on what you
> are in fact doing. I do appreciate your frustration, and the
> difference is unacceptable.
> 
> I have these questions:
> 
> * RDF is not a format, while ARFF is for file format? you mix RDF and
> RDF/XML as if they are the same thing; why?

I use RDF to refer to RDF/XML for brevity. I should make it clear in the
text. 

> * what RDF file format have you used? RDF/XML, as you later refer to?

RDF/XML solely. I have some other measurements showing that RDF/XML
performs better than any other RDF variant (using Jena). 

> * are you using reasoning, and if so why? moreover, you should not
> compare a reasoning environment with a non-reasoning one (of course,
> you'd see differences)

No inference engines are involved.

> * what information is specified in the ARFF header?

Just the URIs of the features in the order they appear in the body of
the document. 

> * why aren't you using a vector annotation in RDF?
> * how large is the file, and what are you doing to use 2GB of heap space?

2.79GB to be precise :) That measurement is taken using a Java profiler
on the following piece of code:

        VRI dataset =
Services.ideaconsult().augment("dataset","585036");
        DatasetSpider dss = new DatasetSpider(dataset);
        dss.parse();

OK, it's not very clear because it's ToxOtis code, but the only thing I
do is that I download and parse an RDF document into an OntModel.

> * how large is your data set?

Approximately 2600 compounds, 177 features. 

> * what does your code look like?
> 

It's layered (lots of dependencies) so I can't really present it as a
standalone piece of code. It is optimized however meaning that using
Jena and the current OpenTox framework it's the best we can do. I would
be very interested if someone has different results with Jena or if
someone has experience with a better library.

> A fair comment would be take ARFF takes a short cut: it imposes
> additional structure on the data, something you identify in your
> report. RDF does not do that by itself. A vector environment does.
> That does not mean that such is not possible with RDF. Have you
> consider what options there are to introduce this vector restriction
> into the computational framework, forced to use RDF? Do you believe it
> is impossible to achieve that with RDF? Would you see it impossible to
> define an ontology to capture vector notation, allowing you to specify
> what each column in that vector represents?
> 

Was it machine-readable, I would suggest Greek as the proper
language/representation for serializing datasets. On the other side, we
have ARFF: it is no more than an algorithm needs. RDF is somewhere in
the middle. 

Personally I like it but we should be using it wisely and for the
purposes it is designed for. And it is supposed to be a framework for
**metadata data modeling**, not data modeling. A dataset has both parts:
data + metadata. I would suggest something like ARFF for the actual
data.

As far as I know from other partners in the project, everything ends up
as ARFF or Matrices. People don't exploit the  flexibility of RDF nor do
they use it for reasoning. Ok, after some time there might be the need
to do something very elaborate and then RDF will prove invaluable. But
I'm more in favor of the scalable solution of an ARFF-like
representation.

Another point I want to make is that RDF is by design memory consuming.
Let me put it figuratively once again: If you read a book page after
page, and you forget everything once you turn to the next page, there's
no way that you can say something for the book as a whole. We either
need an improvement of RDF towards this direction or to adopt an
alternative scheme. And we cannot impose improvised restrictions I
think.

> Now, given that you do see that option too, you would probably end up
> with a ontology looking very much like the ARFF specification, but the
> in RDF.
> 
> In short, based on your report I really cannot judge of RDF is the
> problem, because your results do not make such conclusion possible.
> Instead, I rather think that you are running into a highly confounded
> analysis where it is not possible to assign the slowness to any
> factor. I think you are comparing two widely different data models,
> one optimized for computation (ARFF) and one not (your current RDF/XML
> file). Would that perhaps be the significant factor in the difference
> in speed?
> 
> I am looking forward to a more detailed report on the various involved
> factors that determine the speed here,
> 

I'll present a more detailed version of the report with more evidence.
Till then, any comments are welcome.

Best regards,
Pantelis

> Egon
>