[OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms

Thu Dec 2 14:34:08 CET 2010

Hi Nina,

To organise the discussions better, I have created a new page to
capture the discussion on featuresets.
http://opentox.org/dev/apis/api-1.2/featureset-and-workarounds

I have moved your discussion points on to  this page from the
substructure wishlist page ....

On 30 November 2010 18:43, Nina Jeliazkova <jeliazkova.nina at gmail.com> wrote:
> No problem to extend the API to be able to group features. (In fact we have
> this implemented , even with hierarchical grouping
> http://apps.ideaconsult.net:8080/ambit2/template/Taxonomy , turns to be
> quite useful for ToXML representation ). Could be documented and included in
> the API.

Yeah lets have featuresets please ...

> IMHO there is no sense to assign a feature or a featureset to a dataset ,
> without specifying what is the relationship between dataset and features.
> This is perfectly served by algorithm/model approach so far.
>

We can capture an explicit relationship in a "FeaturesetValue" every
time we assign a "Featureset" to a dataset. Its explicit, simpler and
we can even put the URI of the creating algorithm in the
FeaturesetValue.

> The dummy dataset suggestion is a hack , which lack consistency and I am not
> in favour of it.

I guess the same can be said of Christoph's method of assigning
substructures to compounds to actually just capture the substructure
set in one dataset. On the flip side - every API that I have worked
with (Google Maps API, Facebook API, Facebook Graphs API, Flex/Flash)
has a "hack" which became the norm. IHMO in this case though the best
solution is to have a Featureset with a FeaturesetValue to explicitly
outline the relationship to the dataset.

>
> See above for sets of features.  What we would like to have more than other
> libraries in OpenTox is to be able to tell how these features have been
> calculated.
>
> The CDK fingerprinter ( if you mean
> org.openscience.cdk.fingerprint.Fingerprinter ) is not a good example here,
> since it uses hashed fingerprints, which is almost impossible to translate
> to SMARTS.
>
>  The CDK does very good job for specifying descriptors metadata via
> ontology, but this is not (yet?) done for fingerprinting (as far as I know),
> although fingerprinter algorithm could be included in BlueObelisk or ChemInf
> ontology the same way descriptor algorithms are.
>
>

>
> Sorry, this was my impression from earlier discussions.
>
> Look at my examples, this is exactly what comes from the MCSS model.
>
> Having single compound with substructures, assigned as features is
> inconsistent for the following reason.
>
> The meaning of the set of substructures is that they have been obtained by
> MCSS (or fingerprinting algorithm) , and are MCSS structures for the entire
> dataset.  Assigning them to a single dummy compound means all this
> information is lost.

W.R.T the hack - yes we lose the information - and yet its many times
simpler than creating a model just to represent a set of features
(substructures).

A featureset with featuresetValue solves this problem well, without
resorting to needless model building step or the "hack".
>

>
> Having fragments submitted to another fingerprinter algorithm, which by
> definition works on whole compounds is essentially mixing substructures with
> compounds.  What if the fingerprinter algorithm starts to normalize the
> fragments as if they are compounds?

The fingerprinter in this case is going to take two inputs - dataset
and featureset(substructureset). Again wheres the question of mixing
the two ?

>

>>
>
> curl -X GET /model/id/predicted  gives you list of features, URLs or RDF
> .There is no need to extract anything.  We could easily add a new mime type
> to support SMARTS for feature representation (whenever relevant) and you'll
> get list of smarts by something like
>
> curl -X GET -H "Accept:chemical/x-smarts"  /model/id/predicted
>
> (Hm,  is there MIME format for SMARTS )
>
> Besides,  the current scheme supports ANY kind of fingerprinter, regardless
> if it extracts fragments in the form of SMILES/SMARTS or just report some
> encoded strings (as PubChem fingerprinter) or un-interpretable bits (as
> hashed fingerprints).

Smarts Mime type on the model/id/predicted URL can provide a list of
features - but it is a non-generic way of representing the set of
features. It is non-generic to imagine that a fingerprinting algorithm
takes a model/id/predicted as input - especially since the "model" may
not have any relationship with the fingerprinter.

>
> No, not a dataset comprising of features, but a dataset, comprising of
> compounds.
>
> You could define an algorithm to have list of features as input  into a new
> dataset of compounds , if there is meaningful way to do so (e.g. for smarts
> based features).  This means there will be no assumption features are
> compounds, but a documented service that does the conversion in a known way.

Why would we need to convert features to compounds ?

>
>
>>
>> > curl -X POST /algorithm/features2dataset -d
>> > "feature_uris[]=/model/mcss1/predicted"   ->
>> > /dataset/newdatasetfromfeatures
>> >
>> > Then you are done, POST the dataset into other algorithms as usual.
>>
>> I am sorry but I could not understand how a dataset will be created in
>> this case.
>>
>>
> Setup an algorithm service, which will  read the features, find if a feature
> is a substructure,  generate compounds for them (e.g. SDF  file)  and post
> the SDF content to a dataset service - thus it will create a dataset.

Again I could not get the requirement for such a service - I would
like to capture "substructures" (as features) not convert them to
compounds.

>
> As well for individual compounds.
>
>

>
> IMHO it is not easy to understand  what means "assigning" feature set to a
> dataset?  It doesn't tell neither where the feature set came from, nor how
> it is related to a dataset.  This is all lost of information, which all
> contributes to the poor reproducibility of any models.

Yeah a featuresetValue to capture that explicit value of the
relationship is just what is needed here.

> So far OpenTox has a very simple and logical API (yes, I have heard this
> from external developers) - Datasets are processed by algorithms/models and
> written to datasets - that's all basically.
>

And yet we you want to build this complex logic of making a Model to
essentially store a set of features ?

>
>
>
> Just need a feature set , regardless of how it was obtained ... that's how
> irreproducible cheminformatics models are born...

Most industry chemoinformatics is quite irreproducable - but is paid
for an viable ! Also the onus to reproduce the results are with the
user - not with the datasets and algorithms in the services.

>
>
> If you use dummy dataset with dummy compounds, you are introducing a mess
> into datasets service.  Because those dummy compounds and features, which
> are not really  features for that compound will appear as a result of
> searches , hitting that compound.

So essentially you are saying we cannot have any datasets with some
feature values as "false" or 0 to denote absence in a compound ?

>
>> Ideally however I still maintain its important to have featuresets.
>>
>
> Feature sets alone are fine, see above.

Then lets go for it. We have thought a lot about it and the indirect
methods suggested till now - just seem to be an attempt to stonewall
any big changes in API. Which from the perspective of an API is hara
kiri ? Google Maps API released 3 major versions in 5 years. By
comparison our API upgrades are just moving from 1.1 to 1.2 in 2
years, with barely any changes ...

>
> A generic API means it could be applied to great variety of problems.
> Specific solutions introduce incompatibility.
> It's a generic computer science approach - try to abstract things, break
> larger problem into smaller pieces, find the commonalities.  That's how IT
> works ...

So are we doing chemoinformatics (vs just toxicity) ? Are we
representing atomic features ? If both your answers are "no" then we
can safely say we have not abstracted enough !

>
> Having your simple information represented in a way specific for your
> problem doesn't make things compatible ... Now everybody could use
> https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss to retrieve MCSS
> structures from dataset of their choice and then run it through any of Weka
> algorithms available by any partner.  Will your approach do anything similar
> ?

I can't see a point to debate here ... I am looking for a generic
solution for collecting features. This  can be achieved quite simply
by having a featureset and a featuresetValue (when assigning to a
dataset) to explicitly capture the value of the relationship. The
beauty is we do not need a "model" in the middle to just capture some
explicit relationships.

Regards
Surajit

> Regards,
> Nina
>
>
>>
>> Regards
>> Surajit
>>
>>
>> > Well, my point of view is that an algorithm applied to specific data with
>> > specific parameters should be considered a model (descriptor calculations
>> > included).  An algorithm is just abstract sequence of steps, when one
>> > applies it to data with specific parameters, then a model is generated.
>> > This will make the API much more consistent (now some algorithms generate
>> a
>> > model, and results of other algorithms is a dataset, which is quite
>> > confusing for external developers). But at this point  I am not insisting
>> on
>> > changing the   API that far ;)
>> >
>> > Regards,
>> > Nina
>> >
>> >
>> >
>> >
>> >>
>> >>
>> >> Regards
>> >> Surajit
>> >>
>> >> > Regards,
>> >> > Nina
>> >> >
>> >> >
>> >> >>
>> >> >> Cheers
>> >> >> Surajit
>> >> >>
>> >> >>
>> >> >> On 29 November 2010 14:05, Nina Jeliazkova <
>> jeliazkova.nina at gmail.com>
>> >> >> wrote:
>> >> >> > Dear Christoph, Surajit, All,
>> >> >> >
>> >> >> > This discussion is very useful.
>> >> >> >
>> >> >> > As a result of myself trying to understand both points of view,
>>  now
>> >> we
>> >> >> have
>> >> >> > MCSS algorithm as ambit service  (thanks to CDK SMSD package).
>> >> >> >
>> >> >> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss
>> >> >> >
>> >> >> > It can be applied to a dataset and generates a model, where
>> predicted
>> >> >> > features (MCSS in this case) are available via
>> ot:predictedVariables
>> >> >> > (example
>> >> https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted)
>> >> >> > The features use current API, without any change (although having
>> >> >> > ot:Substructure subclass of ot:Feature will make it more clear).
>> >> >> >
>> >> >> > All the MCSS substructures can be used by any learning algorithm ,
>> as
>> >> >> they
>> >> >> > are standard ot:Features.
>> >> >> >
>> >> >> > Here are more details and proposal (start from *Substructure API
>> >> proposal
>> >> >> > heading *)
>> >> >> >
>> >> >> > http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist
>> >> >> >
>> >> >> > Best regards,
>> >> >> > Nina
>> >> >> >
>> >> >> > P.S. Please note the /mcss algorithm might be slow for large
>> datasets,
>> >> >> there
>> >> >> > are several improvements that we'll be applying  performance wise,
>> but
>> >> >> this
>> >> >> > will not change the API .
>> >> >> >
>> >> >> > On 25 November 2010 18:13, Christoph Helma <helma at in-silico.ch>
>> >> wrote:
>> >> >> >
>> >> >> >> Excerpts from surajit ray's message of Thu Nov 25 14:49:19 +0100
>> >> 2010:
>> >> >> >>
>> >> >> >> > > This type of representation (we are using it internally) has
>> >> served
>> >> >> >> well
>> >> >> >> > > for our datasets which might contain also several (10-100)
>> >> thousand
>> >> >> >> > > substructures for a few thousands compounds. I also do not
>> think,
>> >> >> that
>> >> >> >> > > the representation is redundant:
>> >> >> >> > >        - each compound is represented once
>> >> >> >> > >        - each substructure is represented once
>> >> >> >> > >        - each association between compound and substructure is
>> >> >> >> represented once
>> >> >> >> > > Please correct me, if I am missing something obvious.
>> >> >> >> >
>> >> >> >> > According to this representation each dataEntry for a compound
>> will
>> >> >> >> > have to have all substructure features that were found in them.
>> >> >> >> > Therefore each dataEntry may have 1000-10000
>> feature/featureValue
>> >> >> >> > pairs . For 500 datasentries that means on an average of
>> >> >> >> > 500*5000(assuming 5000 substructures) = 2,500,000
>> >> feature/featureValue
>> >> >> >> > pairs - thats 2.5 million !
>> >> >> >>
>> >> >> >> In our case it is a lot less (not completely sure about your
>> feature
>> >> >> >> types), because only a very small subset of features occurs in a
>> >> single
>> >> >> >> compound.
>> >> >> >>
>> >> >> >> > versus just having a featureset with a
>> >> >> >> > 5000 feature entries. You can imagine the difference in cost of
>> >> >> >> > bandwidth,computation etc.
>> >> >> >>
>> >> >> >> I am not sure, if I get you right, but where do you want to store
>> the
>> >> >> >> relationships between features and compounds? If there are really
>> 2.5
>> >> >> >> million associations you have to assert them somewhere. And having
>> >> >> features
>> >> >> >> without compounds seems to be quite useless for me.
>> >> >> >>
>> >> >> >> > >
>> >> >> >> > > Adding "false" occurences would not violate the current API
>> (but
>> >> >> would
>> >> >> >> > > add redundant information). Keep in mind that the dataset
>> >> >> >> representation
>> >> >> >> > > is mainly for exchanging datasets between services -
>> internally
>> >> you
>> >> >> can
>> >> >> >> > > use any datastructure that is efficient for your purposes (we
>> >> also
>> >> >> do
>> >> >> >> > > that in our services). So if you need fingerprints internally,
>> >> >> extract
>> >> >> >> > > them from the dataset.
>> >> >> >> >
>> >> >> >> > Internalizing an intermediate step completely serves the purpose
>> >> but
>> >> >> >> > leads to less flexible design paradigms. If we internalize the
>> >> >> >> > workflow from substructure extraction to fingerprinting - we
>> will
>> >> lose
>> >> >> >> > the ability to provide the data to a third party server for an
>> >> >> >> > independent workflow. Of course the reasoning could be "who
>> needs
>> >> it
>> >> >> >> > ?" - well you never know !!
>> >> >> >>
>> >> >> >> I am very interested in exchanging "fingerprints" with other
>> >> services,
>> >> >> >> but that can be done already with the current API. I see
>> fingerprints
>> >> as
>> >> >> >> sets of features that are present in a compound (also using set
>> >> >> >> operations to calculate similarities), and find it fairly
>> >> >> >> straightforward to parse/serialize them to/from datasets.
>> >> >> >>
>> >> >> >> >
>> >> >> >> > >> I still suggest having a FeatureSet/SubstructureSet type
>> object
>> >> >> within
>> >> >> >> > >> the API to make it convenient to club features without
>> compound
>> >> >> >> > >> representations.
>> >> >> >> > >
>> >> >> >> > > I prefer to keep the API as generic as possible and not to
>> >> introduce
>> >> >> >> > > ad-hoc objects (or optimizations) for special purposes -
>> >> otherwise
>> >> >> it
>> >> >> >> > > will be difficult to maintain services in the long term. Why
>> >> don't
>> >> >> you
>> >> >> >> > > use ontologies for grouping features?
>> >> >> >> >
>> >> >> >> > Grouping features using ontologies is clubbing the features Not
>> the
>> >> >> >> > feature values
>> >> >> >>
>> >> >> >> But you cannot have feature values without relating features to
>> >> >> >> compounds. If you use the representation I proposed feature values
>> >> are
>> >> >> >> "true" anyway.
>> >> >> >>
>> >> >> >> > So how do we know mcss3 occuring in compound X is with respect
>> to
>> >> >> >> > which compound. As you said we can have arbitary fields in the
>> >> feature
>> >> >> >> > definitions (for MCSS) - but that would be outside API
>> definitions.
>> >> >> >>
>> >> >> >> features:
>> >> >> >>        mcss3:
>> >> >> >>                ot:componds:
>> >> >> >>                        - compound2
>> >> >> >>                        - compound3
>> >> >> >>                ot:smarts: smarts3
>> >> >> >>
>> >> >> >> In my understanding you can add any annotation you want to a
>> feature.
>> >> >> >>
>> >> >> >>
>> >> >> > Yes, you can, but if this is not an agreed annotation,  no other
>> >> service
>> >> >> > will understand it.
>> >> >> >
>> >> >> > Best regards,
>> >> >> > Nina
>> >> >> >
>> >> >> >
>> >> >> >>  Best regards,
>> >> >> >> Christoph
>> >> >> >> _______________________________________________
>> >> >> >> Development mailing list
>> >> >> >> Development at opentox.org
>> >> >> >> http://www.opentox.org/mailman/listinfo/development
>> >> >> >>
>> >> >> > _______________________________________________
>> >> >> > Development mailing list
>> >> >> > Development at opentox.org
>> >> >> > http://www.opentox.org/mailman/listinfo/development
>> >> >> >
>> >> >> _______________________________________________
>> >> >> Development mailing list
>> >> >> Development at opentox.org
>> >> >> http://www.opentox.org/mailman/listinfo/development
>> >> >>
>> >> > _______________________________________________
>> >> > Development mailing list
>> >> > Development at opentox.org
>> >> > http://www.opentox.org/mailman/listinfo/development
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Surajit Ray
>> >> Partner
>> >> www.rareindianart.com
>> >> _______________________________________________
>> >> Development mailing list
>> >> Development at opentox.org
>> >> http://www.opentox.org/mailman/listinfo/development
>> >>
>> > _______________________________________________
>> > Development mailing list
>> > Development at opentox.org
>> > http://www.opentox.org/mailman/listinfo/development
>> >
>>
>>
>>
>> --
>> Surajit Ray
>> Partner
>> www.rareindianart.com
>> _______________________________________________
>> Development mailing list
>> Development at opentox.org
>> http://www.opentox.org/mailman/listinfo/development
>>
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>