[OTDev] Datasets with Features for multi entity relationships ?

surajit ray mr.surajit.ray at gmail.com
Tue Nov 30 09:29:55 CET 2010


Hi,

Another method that does not break the API as well as captures feature
sets is to create a dataset with one compound (maybe a C or CC) and
assign all the substructure features to it (with value as false or
true). In the dc:source of the dataset we can mention the dataset from
which it was derived. And in the description we can describe it as a
dataset to store MCSS features from dataset (or whatever the
relationship with the mother dataset).

I think this would be a simpler method than creating a new Model just
for storing substructures. The problem with the model approach is that
1) The substructures cannot be easily downloaded without accessing the model
2) The set of substructures cannot be given to a better finger printer
(maybe with a faster graph comparator)

The fingerprinter in such a case becomes a separate algorithm which
can take a dataset as input as well as a "featureset" - which is
actually a dummy dataset with the full list of features.

Cheers
Surajit


On 29 November 2010 14:05, Nina Jeliazkova <jeliazkova.nina at gmail.com> wrote:
> Dear Christoph, Surajit, All,
>
> This discussion is very useful.
>
> As a result of myself trying to understand both points of view,  now we have
> MCSS algorithm as ambit service  (thanks to CDK SMSD package).
>
> https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss
>
> It can be applied to a dataset and generates a model, where predicted
> features (MCSS in this case) are available via ot:predictedVariables
> (example  https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted )
> The features use current API, without any change (although having
> ot:Substructure subclass of ot:Feature will make it more clear).
>
> All the MCSS substructures can be used by any learning algorithm , as they
> are standard ot:Features.
>
> Here are more details and proposal (start from *Substructure API proposal
> heading *)
>
> http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist
>
> Best regards,
> Nina
>
> P.S. Please note the /mcss algorithm might be slow for large datasets, there
> are several improvements that we'll be applying  performance wise, but this
> will not change the API .
>
> On 25 November 2010 18:13, Christoph Helma <helma at in-silico.ch> wrote:
>
>> Excerpts from surajit ray's message of Thu Nov 25 14:49:19 +0100 2010:
>>
>> > > This type of representation (we are using it internally) has served
>> well
>> > > for our datasets which might contain also several (10-100) thousand
>> > > substructures for a few thousands compounds. I also do not think, that
>> > > the representation is redundant:
>> > >        - each compound is represented once
>> > >        - each substructure is represented once
>> > >        - each association between compound and substructure is
>> represented once
>> > > Please correct me, if I am missing something obvious.
>> >
>> > According to this representation each dataEntry for a compound will
>> > have to have all substructure features that were found in them.
>> > Therefore each dataEntry may have 1000-10000 feature/featureValue
>> > pairs . For 500 datasentries that means on an average of
>> > 500*5000(assuming 5000 substructures) = 2,500,000 feature/featureValue
>> > pairs - thats 2.5 million !
>>
>> In our case it is a lot less (not completely sure about your feature
>> types), because only a very small subset of features occurs in a single
>> compound.
>>
>> > versus just having a featureset with a
>> > 5000 feature entries. You can imagine the difference in cost of
>> > bandwidth,computation etc.
>>
>> I am not sure, if I get you right, but where do you want to store the
>> relationships between features and compounds? If there are really 2.5
>> million associations you have to assert them somewhere. And having features
>> without compounds seems to be quite useless for me.
>>
>> > >
>> > > Adding "false" occurences would not violate the current API (but would
>> > > add redundant information). Keep in mind that the dataset
>> representation
>> > > is mainly for exchanging datasets between services - internally you can
>> > > use any datastructure that is efficient for your purposes (we also do
>> > > that in our services). So if you need fingerprints internally, extract
>> > > them from the dataset.
>> >
>> > Internalizing an intermediate step completely serves the purpose but
>> > leads to less flexible design paradigms. If we internalize the
>> > workflow from substructure extraction to fingerprinting - we will lose
>> > the ability to provide the data to a third party server for an
>> > independent workflow. Of course the reasoning could be "who needs it
>> > ?" - well you never know !!
>>
>> I am very interested in exchanging "fingerprints" with other services,
>> but that can be done already with the current API. I see fingerprints as
>> sets of features that are present in a compound (also using set
>> operations to calculate similarities), and find it fairly
>> straightforward to parse/serialize them to/from datasets.
>>
>> >
>> > >> I still suggest having a FeatureSet/SubstructureSet type object within
>> > >> the API to make it convenient to club features without compound
>> > >> representations.
>> > >
>> > > I prefer to keep the API as generic as possible and not to introduce
>> > > ad-hoc objects (or optimizations) for special purposes - otherwise it
>> > > will be difficult to maintain services in the long term. Why don't you
>> > > use ontologies for grouping features?
>> >
>> > Grouping features using ontologies is clubbing the features Not the
>> > feature values
>>
>> But you cannot have feature values without relating features to
>> compounds. If you use the representation I proposed feature values are
>> "true" anyway.
>>
>> > So how do we know mcss3 occuring in compound X is with respect to
>> > which compound. As you said we can have arbitary fields in the feature
>> > definitions (for MCSS) - but that would be outside API definitions.
>>
>> features:
>>        mcss3:
>>                ot:componds:
>>                        - compound2
>>                        - compound3
>>                ot:smarts: smarts3
>>
>> In my understanding you can add any annotation you want to a feature.
>>
>>
> Yes, you can, but if this is not an agreed annotation,  no other service
> will understand it.
>
> Best regards,
> Nina
>
>
>>  Best regards,
>> Christoph
>> _______________________________________________
>> Development mailing list
>> Development at opentox.org
>> http://www.opentox.org/mailman/listinfo/development
>>
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>



More information about the Development mailing list