[OTDev] Dataset RDF

Thu Dec 3 13:49:02 CET 2009

Dear Christoph, All,

Christoph Helma wrote:
> Dear Nina, all,
>
> My main point was not the representation of multiple features/compound
> (my example was too simplified), but how to
>
> - indicate that a collection of triples (i.e. graph) belongs to a certain dataset
>   
Add a property ( a predicate), relating the dataset and the
compound-feature-value triple (or the equivalent to data entry).
> - represent metadata about a dataset
>   
If all metadata can be represented as e.g. DC properties, this is as
simple as adding DC.title, DC.creator, etc. properties to the dataset
object.
> Maybe some of my confusion arises also from the fact that
>
> - I have to insert triples (and create anonymous nodes) "by hand" with
>   Redland (AFAIK there is no automated mechanism to create more complex
> 	statements - but the documentation is very sketchy)
>   
Same for other languages - I have put some examples last days at
http://opentox.org/data/documents/development/RDF%20files/JavaOnly/JenaExamples,
<http://opentox.org/data/documents/development/RDF%20files/JavaOnly/JenaExamples>
these should be more or less similar for all languages.
<http://opentox.org/data/documents/development/RDF%20files/JavaOnly/JenaExamples>

> - I have problems to translate the syntactic sugar of your examples into
>   bare-bones triples
>   
Well, this is a good point, I can add examples in NTriple format.  
Personally, I switch into "triple" mode in Protege to examine triples.

> I have e.g. a feature generation service, that creates a dataset with
> features and sends it to the dataset service. If I understand your
> example and Redland correctly I would have to do the following steps to
> create the proposed structure:
>
> 	- create an anonymous node for each compound
> 	- assert that the compound is a ot:compound
> 	- set the identifier URI for the compound
>
> 	- create an anonymous node for each feature
> 	- assert that the feature is a ot:feature
> 	- set the identifier URI for the feature
> 	- set the title of the feature
> 	- set the source of the feature
>
> 	- create an anonymous node for each feature value
> 	- assert that the value is a ot:FeatureValue
> 	- define the ot:feature of the value
> 	- assert the literal value
>
> 	- create an anonymous node for each data entry
> 	- assert that the data entry is a ot:dataEntry
> 	- assert that the compound is a ot:compound
> 	- assert for each feature value that it is a ot:values
>
> 	- create an anonymous node for the dataset
> 	- assert that the dataset is a dataset
> 	- set the identifier URI for the dataset (this has to be rewritten by
> 		the dataset service!)
> 	- insert all data entry nodes
>   
More or less yes (you might use anonymous nodes or named ones) .
> All in all this is quite a lengthy and complicated procedure for a rather
> simple task(I hope I finally have got the idea while writing this down).
>   
Well, yes, but the flexibility of triples come with its verbosity.
> I am proposing two things to reduce the complexity:
>   

> The most straightforward solution to handle sets of graphs (i.e.
> multiple datasets) is to use named graphs, context, quadruples (you name
> it - the concepts are more or less the same). Most RDF
> libraries/datastores support this, but it is not straightforward to
> express these concepts in RDF/XML. Instead of using a workaround that
> complicates things, I would suggest to let the dataset service handle
>   
It is the recommended way to create data models with triples, one could
model lot more complicated things with simple predicate logic...
> datasets (see my previous post). The beneficial side effect is, that
> we can simplify the RDF model to a large extend. The first dataset
>   
Thus we simplify the syntax, with the expense of  losing an essential
functionality , which was the original reason to use RDF.  

Regarding the quads, IMHO , it complicates the setup, because we can't
use the most popular serialization formats and not all libraries have
support for contexts.
And we have a rather simple data structure (set with some structured
entries within), which needs just one additional predicate to be modeled
without involving named graphs.

In fact I've tried couple of times to simplify the current proposal in
Protege, but without success. This is just a non-binary relationship,
which can't be modeled with single predicate.  One can try using
rdfs:Containers for dataset, instead of predicate relating dataset and
dataentry, but this results in going into OWL-Full  language, where
automatic reasoning is much harder than OWL-DL. 
Advice from experts is highly appreciated. 
> example can be eg. rewritten without any loss of information as
>
> # multiple features/compound, simple features
> 	<http://myservice/compound/{id1}> dsstox:MultiCellCall "true"^^xsd:boolean .
> 	<http://myservice/compound/{id1}> lazar:MultiCellCallPredicted "true"^^xsd:boolean .
>
> (assuming that dsstox:MultiCellCall, lazar:MultiCellCallPredicted
> provides the feature definitions).
>   
This is what I am trying  to tell since a while - the assumption is
wrong. One can't mix predicates and objects.  Once you have used
dsstox:MultiCellCall in the place of predicate (property), it can't be
considered a resource anymore, you can't have statements
dsstox:MultiCellCall  owl:sameAs something, nor dsstox:MultiCellCall 
dc:title "something" nor  dsstox:MultiCellCall  ot:units "something" .
You can't relate this feature to Models, Validation objects, etc.

If we go this direction, we simply abandon the power of RDF/OWL
(querying, reasoning) for features/datasets and are treating it as pure
serialization format, not much different than ARFF or  MS Excel.  We
could have stayed with XML as well and not lose couple of months for
educating ourselves.

If it is fine for other partners, OK.  Implementation-wise there is not
problem for ambit, I am not changing the internal structures anyway,
just adding more code to generate different serializations.  But we just
lose lot of nice querying options , ability to linking to external
ontologies, etc.

> It can be retrieved by asking for GET /dataset/{id}. The corresponding
> meta-information from GET /dataset/{id}/metadata would be
>
> 	dc:identifier "http://myservice/dataset/{id}"^^xsd:string ;
> 	dc:title "Multi Cell Call prediction from lazar"^^xsd:string ;
>
>
> The expression of more complex features is also straightforward:
>
> # multiple features/compound, more complex features
> 	<http://myservice/compound/{id1}>
> 																		fminer:BBRC [
> 																									fminer:smarts "NN" ;
> 																									fminer:p_value "0.97" ;
> 																									fminer:effect "activating"
> 																								];
> 																		fminer:BBRC [
> 																									fminer:smarts "CO" ;
> 																									fminer:p_value "0.95" ;
> 																									fminer:effect "deactivating"
> 																								].
>
> # in explicit notation with anonymous nodes
> 	<http://myservice/compound/{id1}> fminer:BBRC 		_:feature1 .
> 	_:feature1 												fminer:smarts 	"NN" .
> 	_:feature1 												fminer:p_value 	"0.97" .
> 	_:feature1 												fminer:effect 	"activating" .
> 	<http://myservice/compound/{id1}> fminer:BBRC 		_:feature2 .
>
>   
I would prefer if you could define a data model in RDFS or OWL with your
proposal, with the ability to link features to other ontologies.  This
will help us avoid lot of misunderstanding.

I think it would be best to leave the final decision (at least until
February deadline) to other partners.

Best regards,
Nina
> Best regards,
> Christoph
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>