[OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms

Nina Jeliazkova jeliazkova.nina at gmail.com
Tue Nov 30 10:22:53 CET 2010


Hi Surajit,

On 30 November 2010 11:00, surajit ray <mr.surajit.ray at gmail.com> wrote:

> Hi,
>
>
> On 30 November 2010 14:15, Nina Jeliazkova <jeliazkova.nina at gmail.com>
> wrote:
> > Hi,
> >
> >
> > On 30 November 2010 10:29, surajit ray <mr.surajit.ray at gmail.com> wrote:
> >
> >> Hi,
> >>
> >> Another method that does not break the API as well as captures feature
> >> sets is to create a dataset with one compound (maybe a C or CC) and
> >> assign all the substructure features to it (with value as false or
> >> true). In the dc:source of the dataset we can mention the dataset from
> >> which it was derived. And in the description we can describe it as a
> >> dataset to store MCSS features from dataset (or whatever the
> >> relationship with the mother dataset).
> >>
> >> I think this would be a simpler method than creating a new Model just
> >> for storing substructures.
> >
> >
> > It might seem simpler, but is definitely less consistent, as it implies
> > different meaning of dataset and properties and their relationships.
>  There
> > will be no explicit relationship to the algorithm/model , doing the
> > processing, which makes MCSS a specific case and breaks OpenTox API ,
> where
> > algorithms and models are the procedures, that process data, and this is
> > explicitly stored in the generated data objects.
>
> The relationship is defined in dc:description of the "featureset". It
> is explicit. Secondly a reference to the algorithm which generated
> this can also be stored in the description.
>


dc:description is an annotation property and does not define any
relationship between classes.

Besides, this breaks OpenTox API, as it differs from the way other
relationships are defined.


>
> >
> > With the current scheme, it is easy to handle algorithms like Kabsh
> > alignment for a dataset with the same  generic mechanism as for MCSS (I
> am
> > sure there will be more cases like this). I don't see the point of
> inventing
> > specific solution for a single case , while it could be handled in a
> generic
> > way (agree with earlier comment by Christoph on that ).
>
> This is not a specific solution but a very general one - one which
> addresses a basic need within any chemistry api - which is to
> represent sets of features independently of compounds.
>

We have features(ot:Feature)  independent of compounds (ot:Compound) . What
makes most sense in modeling , is to have relationship between features and
compounds (the values).

What you are implying is that substructures are both features and compounds
- which they are not, and mixing them is leading to errors and confusion.

If you have substructure "C" and use it for SMARTS searching, it will look
for a single carbon atom.  If you have a compound , defined by smiles "C" ,
it implies CH4 , which is different.  Mixing both is not a good idea, making
the difference explicit makes harder to misinterpret things.


>
> > Besides, the model is definitely not for just storing substructures, it
> can
> > and will be used  for predictions of new compounds (if they have those
> > substructures ) in an uniform way  (POST a new compound to the MCSS model
> > and you'll get if its MCSS substructures are one of existing ones, or it
> is
> > different and far way from that dataset).
>
> What if I have a better graph comparator algorithm for fingerprinting
> - will that take a model as an input just to extract features ?
>

No, define  /algorithm/myfingerprint  , which takes feature_uris[] as input
parameters

curl -X POST /algorithm/myfingerprint -d
"feature_uris[]=/model/mcss1/predicted"

This will work with any set of features.


Even better, if you would like to convert features to a dataset,  define a
converter algorithm, which takes features, verifies if they are
substructures and generates a dataset.

curl -X POST /algorithm/features2dataset -d
"feature_uris[]=/model/mcss1/predicted"   ->
/dataset/newdatasetfromfeatures

Then you are done, POST the dataset into other algorithms as usual.


>
> > The problem with the model approach is that
> >> 1) The substructures cannot be easily downloaded without accessing the
> >> model
> >>
> >
> > They can - /model/id/predicted  give you the list of features  (see my
> > examples)
>
> Well of course a whole model infrastructure may provide a way to
> extract the predicted feature set. But that would imply giving the
> model as an input a third party fingerprinter.
>
>
Not necessarily, see above


>
> > And also - this is exactly the advantage - you don't have just a set of
> > substructures you don't know when they are coming from, but everything is
> > explicitly defined - the substructures are result of applying given
> > algorithm on given dataset.
>
> We know that from the dc:source and dc:description
>

No, we don't. These are annotation properties, not object properties.  They
might provide hints for human readers, while the whole framework strives to
provide explicit relationships for automatic processing.


> >
> >
> >> 2) The set of substructures cannot be given to a better finger printer
> >> (maybe with a faster graph comparator)
> >>
> >>
> > Of course they can - once we have smarts representation of the
> > ot:Substructure - what is the obstacle of feeding them into any other
> > algorithm ?
>
> Again the question - are we going to use a model as an input to
> another algorithm to extract features ?
>

No, see above - features are already available as /model/id/predicted  -
this is just set of features.


> >
> >
> >
> >> The fingerprinter in such a case becomes a separate algorithm which
> >> can take a dataset as input as well as a "featureset" - which is
> >> actually a dummy dataset with the full list of features.
> >>
> >
> > A fingerprinter should be indeed an algorithm - this is how OpenTox API
> is
> > designed.   Any processing should be instantiated as an algorithm.
>
> In your case the fingerprinter is a model ....
>

No, the fingerprinter itself (/algorithm/mcss ) is ot:Algorithm. Only after
it is applied to specific dataset, it becomes a model of exactly that
dataset.

Well, my point of view is that an algorithm applied to specific data with
specific parameters should be considered a model (descriptor calculations
included).  An algorithm is just abstract sequence of steps, when one
applies it to data with specific parameters, then a model is generated.
This will make the API much more consistent (now some algorithms generate a
model, and results of other algorithms is a dataset, which is quite
confusing for external developers). But at this point  I am not insisting on
changing the   API that far ;)

Regards,
Nina




>
>
> Regards
> Surajit
>
> > Regards,
> > Nina
> >
> >
> >>
> >> Cheers
> >> Surajit
> >>
> >>
> >> On 29 November 2010 14:05, Nina Jeliazkova <jeliazkova.nina at gmail.com>
> >> wrote:
> >> > Dear Christoph, Surajit, All,
> >> >
> >> > This discussion is very useful.
> >> >
> >> > As a result of myself trying to understand both points of view,  now
> we
> >> have
> >> > MCSS algorithm as ambit service  (thanks to CDK SMSD package).
> >> >
> >> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss
> >> >
> >> > It can be applied to a dataset and generates a model, where predicted
> >> > features (MCSS in this case) are available via ot:predictedVariables
> >> > (example
> https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted)
> >> > The features use current API, without any change (although having
> >> > ot:Substructure subclass of ot:Feature will make it more clear).
> >> >
> >> > All the MCSS substructures can be used by any learning algorithm , as
> >> they
> >> > are standard ot:Features.
> >> >
> >> > Here are more details and proposal (start from *Substructure API
> proposal
> >> > heading *)
> >> >
> >> > http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist
> >> >
> >> > Best regards,
> >> > Nina
> >> >
> >> > P.S. Please note the /mcss algorithm might be slow for large datasets,
> >> there
> >> > are several improvements that we'll be applying  performance wise, but
> >> this
> >> > will not change the API .
> >> >
> >> > On 25 November 2010 18:13, Christoph Helma <helma at in-silico.ch>
> wrote:
> >> >
> >> >> Excerpts from surajit ray's message of Thu Nov 25 14:49:19 +0100
> 2010:
> >> >>
> >> >> > > This type of representation (we are using it internally) has
> served
> >> >> well
> >> >> > > for our datasets which might contain also several (10-100)
> thousand
> >> >> > > substructures for a few thousands compounds. I also do not think,
> >> that
> >> >> > > the representation is redundant:
> >> >> > >        - each compound is represented once
> >> >> > >        - each substructure is represented once
> >> >> > >        - each association between compound and substructure is
> >> >> represented once
> >> >> > > Please correct me, if I am missing something obvious.
> >> >> >
> >> >> > According to this representation each dataEntry for a compound will
> >> >> > have to have all substructure features that were found in them.
> >> >> > Therefore each dataEntry may have 1000-10000 feature/featureValue
> >> >> > pairs . For 500 datasentries that means on an average of
> >> >> > 500*5000(assuming 5000 substructures) = 2,500,000
> feature/featureValue
> >> >> > pairs - thats 2.5 million !
> >> >>
> >> >> In our case it is a lot less (not completely sure about your feature
> >> >> types), because only a very small subset of features occurs in a
> single
> >> >> compound.
> >> >>
> >> >> > versus just having a featureset with a
> >> >> > 5000 feature entries. You can imagine the difference in cost of
> >> >> > bandwidth,computation etc.
> >> >>
> >> >> I am not sure, if I get you right, but where do you want to store the
> >> >> relationships between features and compounds? If there are really 2.5
> >> >> million associations you have to assert them somewhere. And having
> >> features
> >> >> without compounds seems to be quite useless for me.
> >> >>
> >> >> > >
> >> >> > > Adding "false" occurences would not violate the current API (but
> >> would
> >> >> > > add redundant information). Keep in mind that the dataset
> >> >> representation
> >> >> > > is mainly for exchanging datasets between services - internally
> you
> >> can
> >> >> > > use any datastructure that is efficient for your purposes (we
> also
> >> do
> >> >> > > that in our services). So if you need fingerprints internally,
> >> extract
> >> >> > > them from the dataset.
> >> >> >
> >> >> > Internalizing an intermediate step completely serves the purpose
> but
> >> >> > leads to less flexible design paradigms. If we internalize the
> >> >> > workflow from substructure extraction to fingerprinting - we will
> lose
> >> >> > the ability to provide the data to a third party server for an
> >> >> > independent workflow. Of course the reasoning could be "who needs
> it
> >> >> > ?" - well you never know !!
> >> >>
> >> >> I am very interested in exchanging "fingerprints" with other
> services,
> >> >> but that can be done already with the current API. I see fingerprints
> as
> >> >> sets of features that are present in a compound (also using set
> >> >> operations to calculate similarities), and find it fairly
> >> >> straightforward to parse/serialize them to/from datasets.
> >> >>
> >> >> >
> >> >> > >> I still suggest having a FeatureSet/SubstructureSet type object
> >> within
> >> >> > >> the API to make it convenient to club features without compound
> >> >> > >> representations.
> >> >> > >
> >> >> > > I prefer to keep the API as generic as possible and not to
> introduce
> >> >> > > ad-hoc objects (or optimizations) for special purposes -
> otherwise
> >> it
> >> >> > > will be difficult to maintain services in the long term. Why
> don't
> >> you
> >> >> > > use ontologies for grouping features?
> >> >> >
> >> >> > Grouping features using ontologies is clubbing the features Not the
> >> >> > feature values
> >> >>
> >> >> But you cannot have feature values without relating features to
> >> >> compounds. If you use the representation I proposed feature values
> are
> >> >> "true" anyway.
> >> >>
> >> >> > So how do we know mcss3 occuring in compound X is with respect to
> >> >> > which compound. As you said we can have arbitary fields in the
> feature
> >> >> > definitions (for MCSS) - but that would be outside API definitions.
> >> >>
> >> >> features:
> >> >>        mcss3:
> >> >>                ot:componds:
> >> >>                        - compound2
> >> >>                        - compound3
> >> >>                ot:smarts: smarts3
> >> >>
> >> >> In my understanding you can add any annotation you want to a feature.
> >> >>
> >> >>
> >> > Yes, you can, but if this is not an agreed annotation,  no other
> service
> >> > will understand it.
> >> >
> >> > Best regards,
> >> > Nina
> >> >
> >> >
> >> >>  Best regards,
> >> >> Christoph
> >> >> _______________________________________________
> >> >> Development mailing list
> >> >> Development at opentox.org
> >> >> http://www.opentox.org/mailman/listinfo/development
> >> >>
> >> > _______________________________________________
> >> > Development mailing list
> >> > Development at opentox.org
> >> > http://www.opentox.org/mailman/listinfo/development
> >> >
> >> _______________________________________________
> >> Development mailing list
> >> Development at opentox.org
> >> http://www.opentox.org/mailman/listinfo/development
> >>
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >
>
>
>
> --
> Surajit Ray
> Partner
> www.rareindianart.com
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>



More information about the Development mailing list