[OTDev] Datasets with Features for multi entity relationships ?

Mon Nov 29 09:35:41 CET 2010

Dear Christoph, Surajit, All,

This discussion is very useful.

As a result of myself trying to understand both points of view,  now we have
MCSS algorithm as ambit service  (thanks to CDK SMSD package).

https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss

It can be applied to a dataset and generates a model, where predicted
features (MCSS in this case) are available via ot:predictedVariables
(example  https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted )
The features use current API, without any change (although having
ot:Substructure subclass of ot:Feature will make it more clear).

All the MCSS substructures can be used by any learning algorithm , as they
are standard ot:Features.

Here are more details and proposal (start from *Substructure API proposal
heading *)

http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist

Best regards,
Nina

P.S. Please note the /mcss algorithm might be slow for large datasets, there
are several improvements that we'll be applying  performance wise, but this
will not change the API .

On 25 November 2010 18:13, Christoph Helma <helma at in-silico.ch> wrote:

> Excerpts from surajit ray's message of Thu Nov 25 14:49:19 +0100 2010:
>
> > > This type of representation (we are using it internally) has served
> well
> > > for our datasets which might contain also several (10-100) thousand
> > > substructures for a few thousands compounds. I also do not think, that
> > > the representation is redundant:
> > >        - each compound is represented once
> > >        - each substructure is represented once
> > >        - each association between compound and substructure is
> represented once
> > > Please correct me, if I am missing something obvious.
> >
> > According to this representation each dataEntry for a compound will
> > have to have all substructure features that were found in them.
> > Therefore each dataEntry may have 1000-10000 feature/featureValue
> > pairs . For 500 datasentries that means on an average of
> > 500*5000(assuming 5000 substructures) = 2,500,000 feature/featureValue
> > pairs - thats 2.5 million !
>
> In our case it is a lot less (not completely sure about your feature
> types), because only a very small subset of features occurs in a single
> compound.
>
> > versus just having a featureset with a
> > 5000 feature entries. You can imagine the difference in cost of
> > bandwidth,computation etc.
>
> I am not sure, if I get you right, but where do you want to store the
> relationships between features and compounds? If there are really 2.5
> million associations you have to assert them somewhere. And having features
> without compounds seems to be quite useless for me.
>
> > >
> > > Adding "false" occurences would not violate the current API (but would
> > > add redundant information). Keep in mind that the dataset
> representation
> > > is mainly for exchanging datasets between services - internally you can
> > > use any datastructure that is efficient for your purposes (we also do
> > > that in our services). So if you need fingerprints internally, extract
> > > them from the dataset.
> >
> > Internalizing an intermediate step completely serves the purpose but
> > leads to less flexible design paradigms. If we internalize the
> > workflow from substructure extraction to fingerprinting - we will lose
> > the ability to provide the data to a third party server for an
> > independent workflow. Of course the reasoning could be "who needs it
> > ?" - well you never know !!
>
> I am very interested in exchanging "fingerprints" with other services,
> but that can be done already with the current API. I see fingerprints as
> sets of features that are present in a compound (also using set
> operations to calculate similarities), and find it fairly
> straightforward to parse/serialize them to/from datasets.
>
> >
> > >> I still suggest having a FeatureSet/SubstructureSet type object within
> > >> the API to make it convenient to club features without compound
> > >> representations.
> > >
> > > I prefer to keep the API as generic as possible and not to introduce
> > > ad-hoc objects (or optimizations) for special purposes - otherwise it
> > > will be difficult to maintain services in the long term. Why don't you
> > > use ontologies for grouping features?
> >
> > Grouping features using ontologies is clubbing the features Not the
> > feature values
>
> But you cannot have feature values without relating features to
> compounds. If you use the representation I proposed feature values are
> "true" anyway.
>
> > So how do we know mcss3 occuring in compound X is with respect to
> > which compound. As you said we can have arbitary fields in the feature
> > definitions (for MCSS) - but that would be outside API definitions.
>
> features:
>        mcss3:
>                ot:componds:
>                        - compound2
>                        - compound3
>                ot:smarts: smarts3
>
> In my understanding you can add any annotation you want to a feature.
>
>
Yes, you can, but if this is not an agreed annotation,  no other service
will understand it.

Best regards,
Nina

>  Best regards,
> Christoph
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>