[OTDev] Datasets with Features for multi entity relationships ?

Thu Nov 25 14:17:46 CET 2010

Surajit,

Excerpts from surajit ray's message of Wed Nov 24 05:11:57 +0100 2010:
> 
> For a large dataset, the number of substructures mined by a given
> algorithm may be large (in the rage of thousands). Now according this
> representation - a substructure which occurs in 80% of the compounds
> will have to be associated with 80% of the dataset - vastly increasing
> the size of the dataset representation. Iterating over all the
> substructures may yield a dataset of gigantic proportions.

This type of representation (we are using it internally) has served well
for our datasets which might contain also several (10-100) thousand
substructures for a few thousands compounds. I also do not think, that
the representation is redundant:
	- each compound is represented once
	- each substructure is represented once
	- each association between compound and substructure is represented once
Please correct me, if I am missing something obvious.

It can be problematic to serialize such datasets to OWL-DL (our
benchmarks showed that building the RDF graph is the main bottleneck),
but this is more a RDF/OWL problem than a problem with the basic dataset
structure. Omitting RDF libraries (and thus building RDF graphs) and
serializing directly to strings leads (so far) to impressive performance
gains.

> For our use case we do not really need this as we are anyway
> fingerprinting each compound with  the occurrence of the substructures
> mined. Furthermore the present representation cannot be called a
> fingerprint (of the compounds) with respect to the substructures as we
> would then have to fit in the "FALSE" occurrences as well ( the
> features which do not occur would have to mentioned with a value
> false). Therefore this representation is not serving the fingerprint
> functionality as well, without additional processing.

Adding "false" occurences would not violate the current API (but would
add redundant information). Keep in mind that the dataset representation
is mainly for exchanging datasets between services - internally you can
use any datastructure that is efficient for your purposes (we also do
that in our services). So if you need fingerprints internally, extract
them from the dataset.

> I still suggest having a FeatureSet/SubstructureSet type object within
> the API to make it convenient to club features without compound
> representations.

I prefer to keep the API as generic as possible and not to introduce
ad-hoc objects (or optimizations) for special purposes - otherwise it
will be difficult to maintain services in the long term. Why don't you
use ontologies for grouping features?

> >> Also I have a question about mutually common relationships like MCSS.
> >> MCSS is common to both compounds (being compared). So in your
> >> representation would it be necessary to represent the relationship
> >> twice ? That is once for each compound - or can it be represented just
> >> once and be associated with both compounds ?

You can of course put arbitrary data into the features representation, like:

mcss_feature:
	ot:compounds:
		- compound1
		- compound2
	ot:smarts: c1cccc1(CC)
	ot:hasSource: your_mcss_service_uri

But as a client I would expect to find the association between compounds
and features in the data_entries.

> Does this imply that the dataset will be locked. Without locking the
> dataset onto the two compounds (whose MCSS is being represented) -
> this representation will not work as it is not showing the three way
> relationship. MCSS can have a value of a smarts string and "occur" in
> a compound. But MCSS has to have a third entry - which is the second
> compound being compared to. The above representation can "imply" this
> relationship if the Dataset is locked on the two compounds. Which
> essentially brings us back to the original premise of assigning such
> "relationship" features to locked datasets.

What do you mean by locked? You can of course represent multiple MCSSs in a single dataset:

compounds:
	- compound1
	- compound2
	- compound3

data_entries:
	- compound1:
		mcss1: true
		mcss2: true
	- compound2:
		mcss1: true
		mcss3: true
	- compound3:
		mcss2: true
		mcss3: true

features:
	mcss1:
		ot:smarts: smarts1
	mcss2:
		ot:smarts: smarts2
	mcss3:
		ot:smarts: smarts3

Best regards,
Christoph