[OTDev] Datasets with Features for multi entity relationships ?

Thu Nov 25 17:13:01 CET 2010

Excerpts from surajit ray's message of Thu Nov 25 14:49:19 +0100 2010:

> > This type of representation (we are using it internally) has served well
> > for our datasets which might contain also several (10-100) thousand
> > substructures for a few thousands compounds. I also do not think, that
> > the representation is redundant:
> >        - each compound is represented once
> >        - each substructure is represented once
> >        - each association between compound and substructure is represented once
> > Please correct me, if I am missing something obvious.
> 
> According to this representation each dataEntry for a compound will
> have to have all substructure features that were found in them.
> Therefore each dataEntry may have 1000-10000 feature/featureValue
> pairs . For 500 datasentries that means on an average of
> 500*5000(assuming 5000 substructures) = 2,500,000 feature/featureValue
> pairs - thats 2.5 million !

In our case it is a lot less (not completely sure about your feature
types), because only a very small subset of features occurs in a single
compound.

> versus just having a featureset with a
> 5000 feature entries. You can imagine the difference in cost of
> bandwidth,computation etc.

I am not sure, if I get you right, but where do you want to store the
relationships between features and compounds? If there are really 2.5
million associations you have to assert them somewhere. And having features
without compounds seems to be quite useless for me.

> >
> > Adding "false" occurences would not violate the current API (but would
> > add redundant information). Keep in mind that the dataset representation
> > is mainly for exchanging datasets between services - internally you can
> > use any datastructure that is efficient for your purposes (we also do
> > that in our services). So if you need fingerprints internally, extract
> > them from the dataset.
> 
> Internalizing an intermediate step completely serves the purpose but
> leads to less flexible design paradigms. If we internalize the
> workflow from substructure extraction to fingerprinting - we will lose
> the ability to provide the data to a third party server for an
> independent workflow. Of course the reasoning could be "who needs it
> ?" - well you never know !!

I am very interested in exchanging "fingerprints" with other services,
but that can be done already with the current API. I see fingerprints as
sets of features that are present in a compound (also using set
operations to calculate similarities), and find it fairly
straightforward to parse/serialize them to/from datasets.

> 
> >> I still suggest having a FeatureSet/SubstructureSet type object within
> >> the API to make it convenient to club features without compound
> >> representations.
> >
> > I prefer to keep the API as generic as possible and not to introduce
> > ad-hoc objects (or optimizations) for special purposes - otherwise it
> > will be difficult to maintain services in the long term. Why don't you
> > use ontologies for grouping features?
> 
> Grouping features using ontologies is clubbing the features Not the
> feature values

But you cannot have feature values without relating features to
compounds. If you use the representation I proposed feature values are
"true" anyway.

> So how do we know mcss3 occuring in compound X is with respect to
> which compound. As you said we can have arbitary fields in the feature
> definitions (for MCSS) - but that would be outside API definitions.

features:
	mcss3:
		ot:componds:
			- compound2
			- compound3
		ot:smarts: smarts3

In my understanding you can add any annotation you want to a feature. 

Best regards,
Christoph