[OTDev] Datasets with Features for multi entity relationships ?

Tue Nov 30 10:00:36 CET 2010

Hi,

On 30 November 2010 14:15, Nina Jeliazkova <jeliazkova.nina at gmail.com> wrote:
> Hi,
>
>
> On 30 November 2010 10:29, surajit ray <mr.surajit.ray at gmail.com> wrote:
>
>> Hi,
>>
>> Another method that does not break the API as well as captures feature
>> sets is to create a dataset with one compound (maybe a C or CC) and
>> assign all the substructure features to it (with value as false or
>> true). In the dc:source of the dataset we can mention the dataset from
>> which it was derived. And in the description we can describe it as a
>> dataset to store MCSS features from dataset (or whatever the
>> relationship with the mother dataset).
>>
>> I think this would be a simpler method than creating a new Model just
>> for storing substructures.
>
>
> It might seem simpler, but is definitely less consistent, as it implies
> different meaning of dataset and properties and their relationships.  There
> will be no explicit relationship to the algorithm/model , doing the
> processing, which makes MCSS a specific case and breaks OpenTox API , where
> algorithms and models are the procedures, that process data, and this is
> explicitly stored in the generated data objects.

The relationship is defined in dc:description of the "featureset". It
is explicit. Secondly a reference to the algorithm which generated
this can also be stored in the description.

>
> With the current scheme, it is easy to handle algorithms like Kabsh
> alignment for a dataset with the same  generic mechanism as for MCSS (I am
> sure there will be more cases like this). I don't see the point of inventing
> specific solution for a single case , while it could be handled in a generic
> way (agree with earlier comment by Christoph on that ).

This is not a specific solution but a very general one - one which
addresses a basic need within any chemistry api - which is to
represent sets of features independently of compounds.

> Besides, the model is definitely not for just storing substructures, it can
> and will be used  for predictions of new compounds (if they have those
> substructures ) in an uniform way  (POST a new compound to the MCSS model
> and you'll get if its MCSS substructures are one of existing ones, or it is
> different and far way from that dataset).

What if I have a better graph comparator algorithm for fingerprinting
- will that take a model as an input just to extract features ?

> The problem with the model approach is that
>> 1) The substructures cannot be easily downloaded without accessing the
>> model
>>
>
> They can - /model/id/predicted  give you the list of features  (see my
> examples)

Well of course a whole model infrastructure may provide a way to
extract the predicted feature set. But that would imply giving the
model as an input a third party fingerprinter.

> And also - this is exactly the advantage - you don't have just a set of
> substructures you don't know when they are coming from, but everything is
> explicitly defined - the substructures are result of applying given
> algorithm on given dataset.

We know that from the dc:source and dc:description
>
>
>> 2) The set of substructures cannot be given to a better finger printer
>> (maybe with a faster graph comparator)
>>
>>
> Of course they can - once we have smarts representation of the
> ot:Substructure - what is the obstacle of feeding them into any other
> algorithm ?

Again the question - are we going to use a model as an input to
another algorithm to extract features ?
>
>
>
>> The fingerprinter in such a case becomes a separate algorithm which
>> can take a dataset as input as well as a "featureset" - which is
>> actually a dummy dataset with the full list of features.
>>
>
> A fingerprinter should be indeed an algorithm - this is how OpenTox API is
> designed.   Any processing should be instantiated as an algorithm.

In your case the fingerprinter is a model ....

Regards
Surajit

> Regards,
> Nina
>
>
>>
>> Cheers
>> Surajit
>>
>>
>> On 29 November 2010 14:05, Nina Jeliazkova <jeliazkova.nina at gmail.com>
>> wrote:
>> > Dear Christoph, Surajit, All,
>> >
>> > This discussion is very useful.
>> >
>> > As a result of myself trying to understand both points of view,  now we
>> have
>> > MCSS algorithm as ambit service  (thanks to CDK SMSD package).
>> >
>> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss
>> >
>> > It can be applied to a dataset and generates a model, where predicted
>> > features (MCSS in this case) are available via ot:predictedVariables
>> > (example  https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted)
>> > The features use current API, without any change (although having
>> > ot:Substructure subclass of ot:Feature will make it more clear).
>> >
>> > All the MCSS substructures can be used by any learning algorithm , as
>> they
>> > are standard ot:Features.
>> >
>> > Here are more details and proposal (start from *Substructure API proposal
>> > heading *)
>> >
>> > http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist
>> >
>> > Best regards,
>> > Nina
>> >
>> > P.S. Please note the /mcss algorithm might be slow for large datasets,
>> there
>> > are several improvements that we'll be applying  performance wise, but
>> this
>> > will not change the API .
>> >
>> > On 25 November 2010 18:13, Christoph Helma <helma at in-silico.ch> wrote:
>> >
>> >> Excerpts from surajit ray's message of Thu Nov 25 14:49:19 +0100 2010:
>> >>
>> >> > > This type of representation (we are using it internally) has served
>> >> well
>> >> > > for our datasets which might contain also several (10-100) thousand
>> >> > > substructures for a few thousands compounds. I also do not think,
>> that
>> >> > > the representation is redundant:
>> >> > >        - each compound is represented once
>> >> > >        - each substructure is represented once
>> >> > >        - each association between compound and substructure is
>> >> represented once
>> >> > > Please correct me, if I am missing something obvious.
>> >> >
>> >> > According to this representation each dataEntry for a compound will
>> >> > have to have all substructure features that were found in them.
>> >> > Therefore each dataEntry may have 1000-10000 feature/featureValue
>> >> > pairs . For 500 datasentries that means on an average of
>> >> > 500*5000(assuming 5000 substructures) = 2,500,000 feature/featureValue
>> >> > pairs - thats 2.5 million !
>> >>
>> >> In our case it is a lot less (not completely sure about your feature
>> >> types), because only a very small subset of features occurs in a single
>> >> compound.
>> >>
>> >> > versus just having a featureset with a
>> >> > 5000 feature entries. You can imagine the difference in cost of
>> >> > bandwidth,computation etc.
>> >>
>> >> I am not sure, if I get you right, but where do you want to store the
>> >> relationships between features and compounds? If there are really 2.5
>> >> million associations you have to assert them somewhere. And having
>> features
>> >> without compounds seems to be quite useless for me.
>> >>
>> >> > >
>> >> > > Adding "false" occurences would not violate the current API (but
>> would
>> >> > > add redundant information). Keep in mind that the dataset
>> >> representation
>> >> > > is mainly for exchanging datasets between services - internally you
>> can
>> >> > > use any datastructure that is efficient for your purposes (we also
>> do
>> >> > > that in our services). So if you need fingerprints internally,
>> extract
>> >> > > them from the dataset.
>> >> >
>> >> > Internalizing an intermediate step completely serves the purpose but
>> >> > leads to less flexible design paradigms. If we internalize the
>> >> > workflow from substructure extraction to fingerprinting - we will lose
>> >> > the ability to provide the data to a third party server for an
>> >> > independent workflow. Of course the reasoning could be "who needs it
>> >> > ?" - well you never know !!
>> >>
>> >> I am very interested in exchanging "fingerprints" with other services,
>> >> but that can be done already with the current API. I see fingerprints as
>> >> sets of features that are present in a compound (also using set
>> >> operations to calculate similarities), and find it fairly
>> >> straightforward to parse/serialize them to/from datasets.
>> >>
>> >> >
>> >> > >> I still suggest having a FeatureSet/SubstructureSet type object
>> within
>> >> > >> the API to make it convenient to club features without compound
>> >> > >> representations.
>> >> > >
>> >> > > I prefer to keep the API as generic as possible and not to introduce
>> >> > > ad-hoc objects (or optimizations) for special purposes - otherwise
>> it
>> >> > > will be difficult to maintain services in the long term. Why don't
>> you
>> >> > > use ontologies for grouping features?
>> >> >
>> >> > Grouping features using ontologies is clubbing the features Not the
>> >> > feature values
>> >>
>> >> But you cannot have feature values without relating features to
>> >> compounds. If you use the representation I proposed feature values are
>> >> "true" anyway.
>> >>
>> >> > So how do we know mcss3 occuring in compound X is with respect to
>> >> > which compound. As you said we can have arbitary fields in the feature
>> >> > definitions (for MCSS) - but that would be outside API definitions.
>> >>
>> >> features:
>> >>        mcss3:
>> >>                ot:componds:
>> >>                        - compound2
>> >>                        - compound3
>> >>                ot:smarts: smarts3
>> >>
>> >> In my understanding you can add any annotation you want to a feature.
>> >>
>> >>
>> > Yes, you can, but if this is not an agreed annotation,  no other service
>> > will understand it.
>> >
>> > Best regards,
>> > Nina
>> >
>> >
>> >>  Best regards,
>> >> Christoph
>> >> _______________________________________________
>> >> Development mailing list
>> >> Development at opentox.org
>> >> http://www.opentox.org/mailman/listinfo/development
>> >>
>> > _______________________________________________
>> > Development mailing list
>> > Development at opentox.org
>> > http://www.opentox.org/mailman/listinfo/development
>> >
>> _______________________________________________
>> Development mailing list
>> Development at opentox.org
>> http://www.opentox.org/mailman/listinfo/development
>>
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>

-- 
Surajit Ray
Partner
www.rareindianart.com