[OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms

surajit ray mr.surajit.ray at gmail.com
Tue Nov 30 13:20:34 CET 2010


Hi Nina,

On 30 November 2010 14:52, Nina Jeliazkova <jeliazkova.nina at gmail.com> wrote:
> Hi Surajit,
>
> On 30 November 2010 11:00, surajit ray <mr.surajit.ray at gmail.com> wrote:
>
>> Hi,
>>
>>
>> On 30 November 2010 14:15, Nina Jeliazkova <jeliazkova.nina at gmail.com>
>> wrote:
>> > Hi,
>> >
>> >
>> > On 30 November 2010 10:29, surajit ray <mr.surajit.ray at gmail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> Another method that does not break the API as well as captures feature
>> >> sets is to create a dataset with one compound (maybe a C or CC) and
>> >> assign all the substructure features to it (with value as false or
>> >> true). In the dc:source of the dataset we can mention the dataset from
>> >> which it was derived. And in the description we can describe it as a
>> >> dataset to store MCSS features from dataset (or whatever the
>> >> relationship with the mother dataset).
>> >>
>> >> I think this would be a simpler method than creating a new Model just
>> >> for storing substructures.
>> >
>> >
>> > It might seem simpler, but is definitely less consistent, as it implies
>> > different meaning of dataset and properties and their relationships.
>>  There
>> > will be no explicit relationship to the algorithm/model , doing the
>> > processing, which makes MCSS a specific case and breaks OpenTox API ,
>> where
>> > algorithms and models are the procedures, that process data, and this is
>> > explicitly stored in the generated data objects.
>>
>> The relationship is defined in dc:description of the "featureset". It
>> is explicit. Secondly a reference to the algorithm which generated
>> this can also be stored in the description.
>>
>
>
> dc:description is an annotation property and does not define any
> relationship between classes.

Agreed

> Besides, this breaks OpenTox API, as it differs from the way other
> relationships are defined.

We are discussing also - enhancing the API. The enhancement we were
seeking was about having featuresets assignable to datasets. In the
absence of which having a dummy dataset with a single compound (not a
fragment) - with the substructures assigned as features can be
considered a feature set.

>>
>> >
>> > With the current scheme, it is easy to handle algorithms like Kabsh
>> > alignment for a dataset with the same  generic mechanism as for MCSS (I
>> am
>> > sure there will be more cases like this). I don't see the point of
>> inventing
>> > specific solution for a single case , while it could be handled in a
>> generic
>> > way (agree with earlier comment by Christoph on that ).
>>
>> This is not a specific solution but a very general one - one which
>> addresses a basic need within any chemistry api - which is to
>> represent sets of features independently of compounds.
>>
>
> We have features(ot:Feature)  independent of compounds (ot:Compound) . What
> makes most sense in modeling , is to have relationship between features and
> compounds (the values).

Having sets of features also makes perfect sense. Numerous
fingerprinting libraries carry sets of smarts for features without
them being assigned to any compound. The CDK fingerprinter also uses
such sets.

> What you are implying is that substructures are both features and compounds
> - which they are not, and mixing them is leading to errors and confusion.

I have not implied that. What I have implied is having a single
COMPOUND in the dataset with the substructures assigned as features.

> If you have substructure "C" and use it for SMARTS searching, it will look
> for a single carbon atom.  If you have a compound , defined by smiles "C" ,
> it implies CH4 , which is different.  Mixing both is not a good idea, making
> the difference explicit makes harder to misinterpret things.
>

Wheres the question of mixing ?

>>
>> > Besides, the model is definitely not for just storing substructures, it
>> can
>> > and will be used  for predictions of new compounds (if they have those
>> > substructures ) in an uniform way  (POST a new compound to the MCSS model
>> > and you'll get if its MCSS substructures are one of existing ones, or it
>> is
>> > different and far way from that dataset).
>>
>> What if I have a better graph comparator algorithm for fingerprinting
>> - will that take a model as an input just to extract features ?
>>
>
> No, define  /algorithm/myfingerprint  , which takes feature_uris[] as input
> parameters

So some other process has to go to the model and extract the features
from its set of predicted variables and then supply that information
to the fingerprinter. IHMO its simpler to just supply a featureset(of
substructures/smarts) to the fingerprinter.

> curl -X POST /algorithm/myfingerprint -d
> "feature_uris[]=/model/mcss1/predicted"
>
> This will work with any set of features.
>
>
> Even better, if you would like to convert features to a dataset,  define a
> converter algorithm, which takes features, verifies if they are
> substructures and generates a dataset.

A dataset comprising of what compounds ? You mean a dataset comprising
just of features - I thought that was not possible ?

> curl -X POST /algorithm/features2dataset -d
> "feature_uris[]=/model/mcss1/predicted"   ->
> /dataset/newdatasetfromfeatures
>
> Then you are done, POST the dataset into other algorithms as usual.

I am sorry but I could not understand how a dataset will be created in
this case.

>
>>
>> > The problem with the model approach is that
>> >> 1) The substructures cannot be easily downloaded without accessing the
>> >> model
>> >>
>> >
>> > They can - /model/id/predicted  give you the list of features  (see my
>> > examples)
>>
>> Well of course a whole model infrastructure may provide a way to
>> extract the predicted feature set. But that woulets see how the present API plays out in the long term ...
>> model as an input a third party fingerprinter.
>>
>>
> Not necessarily, see above
>
>
>>
>> > And also - this is exactly the advantage - you don't have just a set of
>> > substructures you don't know when they are coming from, but everything is
>> > explicitly defined - the substructures are result of applying given
>> > algorithm on given dataset.
>>
>> We know that from the dc:source and dc:description
>>
>
> No, we don't. These are annotation properties, not object properties.  They
> might provide hints for human readers, while the whole framework strives to
> provide explicit relationships for automatic processing.

We can have a ot:source field for datasets as well for the
algorithm/model/logic that created it.

>
>> >
>> >
>> >> 2) The set of substructures cannot be given to a better finger printer
>> >> (maybe with a faster graph comparator)
>> >>
>> >>
>> > Of course they can - once we have smarts representation of the
>> > ot:Substructure - what is the obstacle of feeding them into any other
>> > algorithm ?
>>
>> Again the question - are we going to use a model as an input to
>> another algorithm to extract features ?
>>
>
> No, see above - features are already available as /model/id/predicted  -
> this is just set of features.
>
>
>> >
>> >
>> >
>> >> The fingerprinter in such a case becomes a separate algorithm which
>> >> can take a dataset as input as well as a "featureset" - which is
>> >> actually a dummy dataset with the full list of features.
>> >>
>> >
>> > A fingerprinter should be indeed an algorithm - this is how OpenTox API
>> is
>> > designed.   Any processing should be instantiated as an algorithm.
>>
>> In your case the fingerprinter is a model ....
>>
>
> No, the fingerprinter itself (/algorithm/mcss ) is ot:Algorithm. Only after
> it is applied to specific dataset, it becomes a model of exactly that
> dataset.

All this versus just having a featureset and assigning it to a
dataset. IHMO - its not simple to understand either for a third part
developer.

All we need is a method to address a set of features to a dataset. For
that matter every feature value assigned to a compound also should
have an explicitly mentioned relationship through a model etc. But we
are happily neglecting that.

Anyway, all things said and done - I guess for those who need to have
an explicit - ot defined relationship between featureset and dataset
can use the model approach. For those that just need a featureset
(with reference to the logic and source dataset) to work with - can
use the dummy datasets for replacement. Either would not break the
present API.

Ideally however I still maintain its important to have featuresets.

A generic API might be great as a flexible resource, but most
development kits do provide more specific resources. Representing
simple information in a complicated manner may make every thing
complaint within our work group, but will be one of the hurdles to
overcome for external developers. Thats my two cents on this
discussion ....

Regards
Surajit


> Well, my point of view is that an algorithm applied to specific data with
> specific parameters should be considered a model (descriptor calculations
> included).  An algorithm is just abstract sequence of steps, when one
> applies it to data with specific parameters, then a model is generated.
> This will make the API much more consistent (now some algorithms generate a
> model, and results of other algorithms is a dataset, which is quite
> confusing for external developers). But at this point  I am not insisting on
> changing the   API that far ;)
>
> Regards,
> Nina
>
>
>
>
>>
>>
>> Regards
>> Surajit
>>
>> > Regards,
>> > Nina
>> >
>> >
>> >>
>> >> Cheers
>> >> Surajit
>> >>
>> >>
>> >> On 29 November 2010 14:05, Nina Jeliazkova <jeliazkova.nina at gmail.com>
>> >> wrote:
>> >> > Dear Christoph, Surajit, All,
>> >> >
>> >> > This discussion is very useful.
>> >> >
>> >> > As a result of myself trying to understand both points of view,  now
>> we
>> >> have
>> >> > MCSS algorithm as ambit service  (thanks to CDK SMSD package).
>> >> >
>> >> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss
>> >> >
>> >> > It can be applied to a dataset and generates a model, where predicted
>> >> > features (MCSS in this case) are available via ot:predictedVariables
>> >> > (example
>> https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted)
>> >> > The features use current API, without any change (although having
>> >> > ot:Substructure subclass of ot:Feature will make it more clear).
>> >> >
>> >> > All the MCSS substructures can be used by any learning algorithm , as
>> >> they
>> >> > are standard ot:Features.
>> >> >
>> >> > Here are more details and proposal (start from *Substructure API
>> proposal
>> >> > heading *)
>> >> >
>> >> > http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist
>> >> >
>> >> > Best regards,
>> >> > Nina
>> >> >
>> >> > P.S. Please note the /mcss algorithm might be slow for large datasets,
>> >> there
>> >> > are several improvements that we'll be applying  performance wise, but
>> >> this
>> >> > will not change the API .
>> >> >
>> >> > On 25 November 2010 18:13, Christoph Helma <helma at in-silico.ch>
>> wrote:
>> >> >
>> >> >> Excerpts from surajit ray's message of Thu Nov 25 14:49:19 +0100
>> 2010:
>> >> >>
>> >> >> > > This type of representation (we are using it internally) has
>> served
>> >> >> well
>> >> >> > > for our datasets which might contain also several (10-100)
>> thousand
>> >> >> > > substructures for a few thousands compounds. I also do not think,
>> >> that
>> >> >> > > the representation is redundant:
>> >> >> > >        - each compound is represented once
>> >> >> > >        - each substructure is represented once
>> >> >> > >        - each association between compound and substructure is
>> >> >> represented once
>> >> >> > > Please correct me, if I am missing something obvious.
>> >> >> >
>> >> >> > According to this representation each dataEntry for a compound will
>> >> >> > have to have all substructure features that were found in them.
>> >> >> > Therefore each dataEntry may have 1000-10000 feature/featureValue
>> >> >> > pairs . For 500 datasentries that means on an average of
>> >> >> > 500*5000(assuming 5000 substructures) = 2,500,000
>> feature/featureValue
>> >> >> > pairs - thats 2.5 million !
>> >> >>
>> >> >> In our case it is a lot less (not completely sure about your feature
>> >> >> types), because only a very small subset of features occurs in a
>> single
>> >> >> compound.
>> >> >>
>> >> >> > versus just having a featureset with a
>> >> >> > 5000 feature entries. You can imagine the difference in cost of
>> >> >> > bandwidth,computation etc.
>> >> >>
>> >> >> I am not sure, if I get you right, but where do you want to store the
>> >> >> relationships between features and compounds? If there are really 2.5
>> >> >> million associations you have to assert them somewhere. And having
>> >> features
>> >> >> without compounds seems to be quite useless for me.
>> >> >>
>> >> >> > >
>> >> >> > > Adding "false" occurences would not violate the current API (but
>> >> would
>> >> >> > > add redundant information). Keep in mind that the dataset
>> >> >> representation
>> >> >> > > is mainly for exchanging datasets between services - internally
>> you
>> >> can
>> >> >> > > use any datastructure that is efficient for your purposes (we
>> also
>> >> do
>> >> >> > > that in our services). So if you need fingerprints internally,
>> >> extract
>> >> >> > > them from the dataset.
>> >> >> >
>> >> >> > Internalizing an intermediate step completely serves the purpose
>> but
>> >> >> > leads to less flexible design paradigms. If we internalize the
>> >> >> > workflow from substructure extraction to fingerprinting - we will
>> lose
>> >> >> > the ability to provide the data to a third party server for an
>> >> >> > independent workflow. Of course the reasoning could be "who needs
>> it
>> >> >> > ?" - well you never know !!
>> >> >>
>> >> >> I am very interested in exchanging "fingerprints" with other
>> services,
>> >> >> but that can be done already with the current API. I see fingerprints
>> as
>> >> >> sets of features that are present in a compound (also using set
>> >> >> operations to calculate similarities), and find it fairly
>> >> >> straightforward to parse/serialize them to/from datasets.
>> >> >>
>> >> >> >
>> >> >> > >> I still suggest having a FeatureSet/SubstructureSet type object
>> >> within
>> >> >> > >> the API to make it convenient to club features without compound
>> >> >> > >> representations.
>> >> >> > >
>> >> >> > > I prefer to keep the API as generic as possible and not to
>> introduce
>> >> >> > > ad-hoc objects (or optimizations) for special purposes -
>> otherwise
>> >> it
>> >> >> > > will be difficult to maintain services in the long term. Why
>> don't
>> >> you
>> >> >> > > use ontologies for grouping features?
>> >> >> >
>> >> >> > Grouping features using ontologies is clubbing the features Not the
>> >> >> > feature values
>> >> >>
>> >> >> But you cannot have feature values without relating features to
>> >> >> compounds. If you use the representation I proposed feature values
>> are
>> >> >> "true" anyway.
>> >> >>
>> >> >> > So how do we know mcss3 occuring in compound X is with respect to
>> >> >> > which compound. As you said we can have arbitary fields in the
>> feature
>> >> >> > definitions (for MCSS) - but that would be outside API definitions.
>> >> >>
>> >> >> features:
>> >> >>        mcss3:
>> >> >>                ot:componds:
>> >> >>                        - compound2
>> >> >>                        - compound3
>> >> >>                ot:smarts: smarts3
>> >> >>
>> >> >> In my understanding you can add any annotation you want to a feature.
>> >> >>
>> >> >>
>> >> > Yes, you can, but if this is not an agreed annotation,  no other
>> service
>> >> > will understand it.
>> >> >
>> >> > Best regards,
>> >> > Nina
>> >> >
>> >> >
>> >> >>  Best regards,
>> >> >> Christoph
>> >> >> _______________________________________________
>> >> >> Development mailing list
>> >> >> Development at opentox.org
>> >> >> http://www.opentox.org/mailman/listinfo/development
>> >> >>
>> >> > _______________________________________________
>> >> > Development mailing list
>> >> > Development at opentox.org
>> >> > http://www.opentox.org/mailman/listinfo/development
>> >> >
>> >> _______________________________________________
>> >> Development mailing list
>> >> Development at opentox.org
>> >> http://www.opentox.org/mailman/listinfo/development
>> >>
>> > _______________________________________________
>> > Development mailing list
>> > Development at opentox.org
>> > http://www.opentox.org/mailman/listinfo/development
>> >
>>
>>
>>
>> --
>> Surajit Ray
>> Partner
>> www.rareindianart.com
>> _______________________________________________
>> Development mailing list
>> Development at opentox.org
>> http://www.opentox.org/mailman/listinfo/development
>>
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>



-- 
Surajit Ray
Partner
www.rareindianart.com



More information about the Development mailing list