[OTDev] Dataset RDF

Thu Dec 3 13:04:40 CET 2009

Dear Nina, all,

My main point was not the representation of multiple features/compound
(my example was too simplified), but how to

- indicate that a collection of triples (i.e. graph) belongs to a certain dataset
- represent metadata about a dataset

Maybe some of my confusion arises also from the fact that

- I have to insert triples (and create anonymous nodes) "by hand" with
  Redland (AFAIK there is no automated mechanism to create more complex
	statements - but the documentation is very sketchy)

- I have problems to translate the syntactic sugar of your examples into
  bare-bones triples

I have e.g. a feature generation service, that creates a dataset with
features and sends it to the dataset service. If I understand your
example and Redland correctly I would have to do the following steps to
create the proposed structure:

	- create an anonymous node for each compound
	- assert that the compound is a ot:compound
	- set the identifier URI for the compound

	- create an anonymous node for each feature
	- assert that the feature is a ot:feature
	- set the identifier URI for the feature
	- set the title of the feature
	- set the source of the feature

	- create an anonymous node for each feature value
	- assert that the value is a ot:FeatureValue
	- define the ot:feature of the value
	- assert the literal value

	- create an anonymous node for each data entry
	- assert that the data entry is a ot:dataEntry
	- assert that the compound is a ot:compound
	- assert for each feature value that it is a ot:values

	- create an anonymous node for the dataset
	- assert that the dataset is a dataset
	- set the identifier URI for the dataset (this has to be rewritten by
		the dataset service!)
	- insert all data entry nodes

All in all this is quite a lengthy and complicated procedure for a rather
simple task(I hope I finally have got the idea while writing this down).
I am proposing two things to reduce the complexity:

The most straightforward solution to handle sets of graphs (i.e.
multiple datasets) is to use named graphs, context, quadruples (you name
it - the concepts are more or less the same). Most RDF
libraries/datastores support this, but it is not straightforward to
express these concepts in RDF/XML. Instead of using a workaround that
complicates things, I would suggest to let the dataset service handle
datasets (see my previous post). The beneficial side effect is, that
we can simplify the RDF model to a large extend. The first dataset
example can be eg. rewritten without any loss of information as

# multiple features/compound, simple features
	<http://myservice/compound/{id1}> dsstox:MultiCellCall "true"^^xsd:boolean .
	<http://myservice/compound/{id1}> lazar:MultiCellCallPredicted "true"^^xsd:boolean .

(assuming that dsstox:MultiCellCall, lazar:MultiCellCallPredicted
provides the feature definitions).

It can be retrieved by asking for GET /dataset/{id}. The corresponding
meta-information from GET /dataset/{id}/metadata would be

	dc:identifier "http://myservice/dataset/{id}"^^xsd:string ;
	dc:title "Multi Cell Call prediction from lazar"^^xsd:string ;

The expression of more complex features is also straightforward:

# multiple features/compound, more complex features
	<http://myservice/compound/{id1}>
																		fminer:BBRC [
																									fminer:smarts "NN" ;
																									fminer:p_value "0.97" ;
																									fminer:effect "activating"
																								];
																		fminer:BBRC [
																									fminer:smarts "CO" ;
																									fminer:p_value "0.95" ;
																									fminer:effect "deactivating"
																								].

# in explicit notation with anonymous nodes
	<http://myservice/compound/{id1}> fminer:BBRC 		_:feature1 .
	_:feature1 												fminer:smarts 	"NN" .
	_:feature1 												fminer:p_value 	"0.97" .
	_:feature1 												fminer:effect 	"activating" .
	<http://myservice/compound/{id1}> fminer:BBRC 		_:feature2 .

Best regards,
Christoph