[OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms

Thu Dec 2 14:44:39 CET 2010

Hi Surajit,

"Most industry chemoinformatics is quite irreproducable - but is paid
for an viable ! Also the onus to reproduce the results are with the
user - not with the datasets and algorithms in the services."

This is exactly what we are struggling to overcome in OpenTox - thus, if we
don't agree on this point, I don't see the point of the discussion and will
leave the consensus for others.

Having a hack for something which is demonstrated to have a solution with
the current API doesn't make sense to me (IMHO, IMHO).

Regards,
Nina

On 2 December 2010 15:34, surajit ray <mr.surajit.ray at gmail.com> wrote:

> Hi Nina,
>
> To organise the discussions better, I have created a new page to
> capture the discussion on featuresets.
> http://opentox.org/dev/apis/api-1.2/featureset-and-workarounds
>
> I have moved your discussion points on to  this page from the
> substructure wishlist page ....
>
> On 30 November 2010 18:43, Nina Jeliazkova <jeliazkova.nina at gmail.com>
> wrote:
> > No problem to extend the API to be able to group features. (In fact we
> have
> > this implemented , even with hierarchical grouping
> > http://apps.ideaconsult.net:8080/ambit2/template/Taxonomy , turns to be
> > quite useful for ToXML representation ). Could be documented and included
> in
> > the API.
>
> Yeah lets have featuresets please ...
>
> > IMHO there is no sense to assign a feature or a featureset to a dataset ,
> > without specifying what is the relationship between dataset and features.
> > This is perfectly served by algorithm/model approach so far.
> >
>
> We can capture an explicit relationship in a "FeaturesetValue" every
> time we assign a "Featureset" to a dataset. Its explicit, simpler and
> we can even put the URI of the creating algorithm in the
> FeaturesetValue.
>
> > The dummy dataset suggestion is a hack , which lack consistency and I am
> not
> > in favour of it.
>
> I guess the same can be said of Christoph's method of assigning
> substructures to compounds to actually just capture the substructure
> set in one dataset. On the flip side - every API that I have worked
> with (Google Maps API, Facebook API, Facebook Graphs API, Flex/Flash)
> has a "hack" which became the norm. IHMO in this case though the best
> solution is to have a Featureset with a FeaturesetValue to explicitly
> outline the relationship to the dataset.
>
>
> >
> > See above for sets of features.  What we would like to have more than
> other
> > libraries in OpenTox is to be able to tell how these features have been
> > calculated.
> >
> > The CDK fingerprinter ( if you mean
> > org.openscience.cdk.fingerprint.Fingerprinter ) is not a good example
> here,
> > since it uses hashed fingerprints, which is almost impossible to
> translate
> > to SMARTS.
> >
> >  The CDK does very good job for specifying descriptors metadata via
> > ontology, but this is not (yet?) done for fingerprinting (as far as I
> know),
> > although fingerprinter algorithm could be included in BlueObelisk or
> ChemInf
> > ontology the same way descriptor algorithms are.
> >
> >
>
> >
> > Sorry, this was my impression from earlier discussions.
> >
> > Look at my examples, this is exactly what comes from the MCSS model.
> >
> > Having single compound with substructures, assigned as features is
> > inconsistent for the following reason.
> >
> > The meaning of the set of substructures is that they have been obtained
> by
> > MCSS (or fingerprinting algorithm) , and are MCSS structures for the
> entire
> > dataset.  Assigning them to a single dummy compound means all this
> > information is lost.
>
> W.R.T the hack - yes we lose the information - and yet its many times
> simpler than creating a model just to represent a set of features
> (substructures).
>
> A featureset with featuresetValue solves this problem well, without
> resorting to needless model building step or the "hack".
> >
>
> >
> > Having fragments submitted to another fingerprinter algorithm, which by
> > definition works on whole compounds is essentially mixing substructures
> with
> > compounds.  What if the fingerprinter algorithm starts to normalize the
> > fragments as if they are compounds?
>
> The fingerprinter in this case is going to take two inputs - dataset
> and featureset(substructureset). Again wheres the question of mixing
> the two ?
>
> >
>
> >>
> >
> > curl -X GET /model/id/predicted  gives you list of features, URLs or RDF
> > .There is no need to extract anything.  We could easily add a new mime
> type
> > to support SMARTS for feature representation (whenever relevant) and
> you'll
> > get list of smarts by something like
> >
> > curl -X GET -H "Accept:chemical/x-smarts"  /model/id/predicted
> >
> > (Hm,  is there MIME format for SMARTS )
> >
> > Besides,  the current scheme supports ANY kind of fingerprinter,
> regardless
> > if it extracts fragments in the form of SMILES/SMARTS or just report some
> > encoded strings (as PubChem fingerprinter) or un-interpretable bits (as
> > hashed fingerprints).
>
> Smarts Mime type on the model/id/predicted URL can provide a list of
> features - but it is a non-generic way of representing the set of
> features. It is non-generic to imagine that a fingerprinting algorithm
> takes a model/id/predicted as input - especially since the "model" may
> not have any relationship with the fingerprinter.
>
>
> >
> > No, not a dataset comprising of features, but a dataset, comprising of
> > compounds.
> >
> > You could define an algorithm to have list of features as input  into a
> new
> > dataset of compounds , if there is meaningful way to do so (e.g. for
> smarts
> > based features).  This means there will be no assumption features are
> > compounds, but a documented service that does the conversion in a known
> way.
>
> Why would we need to convert features to compounds ?
>
> >
> >
> >>
> >> > curl -X POST /algorithm/features2dataset -d
> >> > "feature_uris[]=/model/mcss1/predicted"   ->
> >> > /dataset/newdatasetfromfeatures
> >> >
> >> > Then you are done, POST the dataset into other algorithms as usual.
> >>
> >> I am sorry but I could not understand how a dataset will be created in
> >> this case.
> >>
> >>
> > Setup an algorithm service, which will  read the features, find if a
> feature
> > is a substructure,  generate compounds for them (e.g. SDF  file)  and
> post
> > the SDF content to a dataset service - thus it will create a dataset.
>
> Again I could not get the requirement for such a service - I would
> like to capture "substructures" (as features) not convert them to
> compounds.
>
>
> >
> > As well for individual compounds.
> >
> >
>
> >
> > IMHO it is not easy to understand  what means "assigning" feature set to
> a
> > dataset?  It doesn't tell neither where the feature set came from, nor
> how
> > it is related to a dataset.  This is all lost of information, which all
> > contributes to the poor reproducibility of any models.
>
> Yeah a featuresetValue to capture that explicit value of the
> relationship is just what is needed here.
>
> > So far OpenTox has a very simple and logical API (yes, I have heard this
> > from external developers) - Datasets are processed by algorithms/models
> and
> > written to datasets - that's all basically.
> >
>
> And yet we you want to build this complex logic of making a Model to
> essentially store a set of features ?
>
> >
> >
> >
> > Just need a feature set , regardless of how it was obtained ... that's
> how
> > irreproducible cheminformatics models are born...
>
> Most industry chemoinformatics is quite irreproducable - but is paid
> for an viable ! Also the onus to reproduce the results are with the
> user - not with the datasets and algorithms in the services.
>
> >
> >
> > If you use dummy dataset with dummy compounds, you are introducing a mess
> > into datasets service.  Because those dummy compounds and features, which
> > are not really  features for that compound will appear as a result of
> > searches , hitting that compound.
>
> So essentially you are saying we cannot have any datasets with some
> feature values as "false" or 0 to denote absence in a compound ?
>
> >
> >> Ideally however I still maintain its important to have featuresets.
> >>
> >
> > Feature sets alone are fine, see above.
>
> Then lets go for it. We have thought a lot about it and the indirect
> methods suggested till now - just seem to be an attempt to stonewall
> any big changes in API. Which from the perspective of an API is hara
> kiri ? Google Maps API released 3 major versions in 5 years. By
> comparison our API upgrades are just moving from 1.1 to 1.2 in 2
> years, with barely any changes ...
>
> >
> > A generic API means it could be applied to great variety of problems.
> > Specific solutions introduce incompatibility.
> > It's a generic computer science approach - try to abstract things, break
> > larger problem into smaller pieces, find the commonalities.  That's how
> IT
> > works ...
>
> So are we doing chemoinformatics (vs just toxicity) ? Are we
> representing atomic features ? If both your answers are "no" then we
> can safely say we have not abstracted enough !
>
> >
> > Having your simple information represented in a way specific for your
> > problem doesn't make things compatible ... Now everybody could use
> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss to retrieve MCSS
> > structures from dataset of their choice and then run it through any of
> Weka
> > algorithms available by any partner.  Will your approach do anything
> similar
> > ?
>
> I can't see a point to debate here ... I am looking for a generic
> solution for collecting features. This  can be achieved quite simply
> by having a featureset and a featuresetValue (when assigning to a
> dataset) to explicitly capture the value of the relationship. The
> beauty is we do not need a "model" in the middle to just capture some
> explicit relationships.
>
>
> Regards
> Surajit
>
> > Regards,
> > Nina
> >
> >
> >>
> >> Regards
> >> Surajit
> >>
> >>
> >> > Well, my point of view is that an algorithm applied to specific data
> with
> >> > specific parameters should be considered a model (descriptor
> calculations
> >> > included).  An algorithm is just abstract sequence of steps, when one
> >> > applies it to data with specific parameters, then a model is
> generated.
> >> > This will make the API much more consistent (now some algorithms
> generate
> >> a
> >> > model, and results of other algorithms is a dataset, which is quite
> >> > confusing for external developers). But at this point  I am not
> insisting
> >> on
> >> > changing the   API that far ;)
> >> >
> >> > Regards,
> >> > Nina
> >> >
> >> >
> >> >
> >> >
> >> >>
> >> >>
> >> >> Regards
> >> >> Surajit
> >> >>
> >> >> > Regards,
> >> >> > Nina
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> Cheers
> >> >> >> Surajit
> >> >> >>
> >> >> >>
> >> >> >> On 29 November 2010 14:05, Nina Jeliazkova <
> >> jeliazkova.nina at gmail.com>
> >> >> >> wrote:
> >> >> >> > Dear Christoph, Surajit, All,
> >> >> >> >
> >> >> >> > This discussion is very useful.
> >> >> >> >
> >> >> >> > As a result of myself trying to understand both points of view,
> >>  now
> >> >> we
> >> >> >> have
> >> >> >> > MCSS algorithm as ambit service  (thanks to CDK SMSD package).
> >> >> >> >
> >> >> >> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss
> >> >> >> >
> >> >> >> > It can be applied to a dataset and generates a model, where
> >> predicted
> >> >> >> > features (MCSS in this case) are available via
> >> ot:predictedVariables
> >> >> >> > (example
> >> >> https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted)
> >> >> >> > The features use current API, without any change (although
> having
> >> >> >> > ot:Substructure subclass of ot:Feature will make it more clear).
> >> >> >> >
> >> >> >> > All the MCSS substructures can be used by any learning algorithm
> ,
> >> as
> >> >> >> they
> >> >> >> > are standard ot:Features.
> >> >> >> >
> >> >> >> > Here are more details and proposal (start from *Substructure API
> >> >> proposal
> >> >> >> > heading *)
> >> >> >> >
> >> >> >> > http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist
> >> >> >> >
> >> >> >> > Best regards,
> >> >> >> > Nina
> >> >> >> >
> >> >> >> > P.S. Please note the /mcss algorithm might be slow for large
> >> datasets,
> >> >> >> there
> >> >> >> > are several improvements that we'll be applying  performance
> wise,
> >> but
> >> >> >> this
> >> >> >> > will not change the API .
> >> >> >> >
> >> >> >> > On 25 November 2010 18:13, Christoph Helma <helma at in-silico.ch>
> >> >> wrote:
> >> >> >> >
> >> >> >> >> Excerpts from surajit ray's message of Thu Nov 25 14:49:19
> +0100
> >> >> 2010:
> >> >> >> >>
> >> >> >> >> > > This type of representation (we are using it internally)
> has
> >> >> served
> >> >> >> >> well
> >> >> >> >> > > for our datasets which might contain also several (10-100)
> >> >> thousand
> >> >> >> >> > > substructures for a few thousands compounds. I also do not
> >> think,
> >> >> >> that
> >> >> >> >> > > the representation is redundant:
> >> >> >> >> > >        - each compound is represented once
> >> >> >> >> > >        - each substructure is represented once
> >> >> >> >> > >        - each association between compound and substructure
> is
> >> >> >> >> represented once
> >> >> >> >> > > Please correct me, if I am missing something obvious.
> >> >> >> >> >
> >> >> >> >> > According to this representation each dataEntry for a
> compound
> >> will
> >> >> >> >> > have to have all substructure features that were found in
> them.
> >> >> >> >> > Therefore each dataEntry may have 1000-10000
> >> feature/featureValue
> >> >> >> >> > pairs . For 500 datasentries that means on an average of
> >> >> >> >> > 500*5000(assuming 5000 substructures) = 2,500,000
> >> >> feature/featureValue
> >> >> >> >> > pairs - thats 2.5 million !
> >> >> >> >>
> >> >> >> >> In our case it is a lot less (not completely sure about your
> >> feature
> >> >> >> >> types), because only a very small subset of features occurs in
> a
> >> >> single
> >> >> >> >> compound.
> >> >> >> >>
> >> >> >> >> > versus just having a featureset with a
> >> >> >> >> > 5000 feature entries. You can imagine the difference in cost
> of
> >> >> >> >> > bandwidth,computation etc.
> >> >> >> >>
> >> >> >> >> I am not sure, if I get you right, but where do you want to
> store
> >> the
> >> >> >> >> relationships between features and compounds? If there are
> really
> >> 2.5
> >> >> >> >> million associations you have to assert them somewhere. And
> having
> >> >> >> features
> >> >> >> >> without compounds seems to be quite useless for me.
> >> >> >> >>
> >> >> >> >> > >
> >> >> >> >> > > Adding "false" occurences would not violate the current API
> >> (but
> >> >> >> would
> >> >> >> >> > > add redundant information). Keep in mind that the dataset
> >> >> >> >> representation
> >> >> >> >> > > is mainly for exchanging datasets between services -
> >> internally
> >> >> you
> >> >> >> can
> >> >> >> >> > > use any datastructure that is efficient for your purposes
> (we
> >> >> also
> >> >> >> do
> >> >> >> >> > > that in our services). So if you need fingerprints
> internally,
> >> >> >> extract
> >> >> >> >> > > them from the dataset.
> >> >> >> >> >
> >> >> >> >> > Internalizing an intermediate step completely serves the
> purpose
> >> >> but
> >> >> >> >> > leads to less flexible design paradigms. If we internalize
> the
> >> >> >> >> > workflow from substructure extraction to fingerprinting - we
> >> will
> >> >> lose
> >> >> >> >> > the ability to provide the data to a third party server for
> an
> >> >> >> >> > independent workflow. Of course the reasoning could be "who
> >> needs
> >> >> it
> >> >> >> >> > ?" - well you never know !!
> >> >> >> >>
> >> >> >> >> I am very interested in exchanging "fingerprints" with other
> >> >> services,
> >> >> >> >> but that can be done already with the current API. I see
> >> fingerprints
> >> >> as
> >> >> >> >> sets of features that are present in a compound (also using set
> >> >> >> >> operations to calculate similarities), and find it fairly
> >> >> >> >> straightforward to parse/serialize them to/from datasets.
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> > >> I still suggest having a FeatureSet/SubstructureSet type
> >> object
> >> >> >> within
> >> >> >> >> > >> the API to make it convenient to club features without
> >> compound
> >> >> >> >> > >> representations.
> >> >> >> >> > >
> >> >> >> >> > > I prefer to keep the API as generic as possible and not to
> >> >> introduce
> >> >> >> >> > > ad-hoc objects (or optimizations) for special purposes -
> >> >> otherwise
> >> >> >> it
> >> >> >> >> > > will be difficult to maintain services in the long term.
> Why
> >> >> don't
> >> >> >> you
> >> >> >> >> > > use ontologies for grouping features?
> >> >> >> >> >
> >> >> >> >> > Grouping features using ontologies is clubbing the features
> Not
> >> the
> >> >> >> >> > feature values
> >> >> >> >>
> >> >> >> >> But you cannot have feature values without relating features to
> >> >> >> >> compounds. If you use the representation I proposed feature
> values
> >> >> are
> >> >> >> >> "true" anyway.
> >> >> >> >>
> >> >> >> >> > So how do we know mcss3 occuring in compound X is with
> respect
> >> to
> >> >> >> >> > which compound. As you said we can have arbitary fields in
> the
> >> >> feature
> >> >> >> >> > definitions (for MCSS) - but that would be outside API
> >> definitions.
> >> >> >> >>
> >> >> >> >> features:
> >> >> >> >>        mcss3:
> >> >> >> >>                ot:componds:
> >> >> >> >>                        - compound2
> >> >> >> >>                        - compound3
> >> >> >> >>                ot:smarts: smarts3
> >> >> >> >>
> >> >> >> >> In my understanding you can add any annotation you want to a
> >> feature.
> >> >> >> >>
> >> >> >> >>
> >> >> >> > Yes, you can, but if this is not an agreed annotation,  no other
> >> >> service
> >> >> >> > will understand it.
> >> >> >> >
> >> >> >> > Best regards,
> >> >> >> > Nina
> >> >> >> >
> >> >> >> >
> >> >> >> >>  Best regards,
> >> >> >> >> Christoph
> >> >> >> >> _______________________________________________
> >> >> >> >> Development mailing list
> >> >> >> >> Development at opentox.org
> >> >> >> >> http://www.opentox.org/mailman/listinfo/development
> >> >> >> >>
> >> >> >> > _______________________________________________
> >> >> >> > Development mailing list
> >> >> >> > Development at opentox.org
> >> >> >> > http://www.opentox.org/mailman/listinfo/development
> >> >> >> >
> >> >> >> _______________________________________________
> >> >> >> Development mailing list
> >> >> >> Development at opentox.org
> >> >> >> http://www.opentox.org/mailman/listinfo/development
> >> >> >>
> >> >> > _______________________________________________
> >> >> > Development mailing list
> >> >> > Development at opentox.org
> >> >> > http://www.opentox.org/mailman/listinfo/development
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Surajit Ray
> >> >> Partner
> >> >> www.rareindianart.com
> >> >> _______________________________________________
> >> >> Development mailing list
> >> >> Development at opentox.org
> >> >> http://www.opentox.org/mailman/listinfo/development
> >> >>
> >> > _______________________________________________
> >> > Development mailing list
> >> > Development at opentox.org
> >> > http://www.opentox.org/mailman/listinfo/development
> >> >
> >>
> >>
> >>
> >> --
> >> Surajit Ray
> >> Partner
> >> www.rareindianart.com
> >> _______________________________________________
> >> Development mailing list
> >> Development at opentox.org
> >> http://www.opentox.org/mailman/listinfo/development
> >>
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>