[OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms

Thu Dec 2 14:56:04 CET 2010

Hi Nina,

Well well ... I have suggested a method other than my previous "hack"
... which is a featureset with a featuresetValue. Did you forget to
read that ?

On 2 December 2010 19:14, Nina Jeliazkova <jeliazkova.nina at gmail.com> wrote:
> Hi Surajit,
>
>
> "Most industry chemoinformatics is quite irreproducable - but is paid
> for an viable ! Also the onus to reproduce the results are with the
> user - not with the datasets and algorithms in the services."
>
>
> This is exactly what we are struggling to overcome in OpenTox - thus, if we
> don't agree on this point, I don't see the point of the discussion and will
> leave the consensus for others.

Are we ? Has our focus shifted from provided REACH complaint toxicity
prediction - to generic chemoinformatics abstractions and automatic
reproducability (which any way we are very far from presently) ?

A comparable example is - the onus to reproduce the results is with
the researcher (in a wet-lab) not with the test tubes and pipettes and
petri dishes.

> Having a hack for something which is demonstrated to have a solution with
> the current API doesn't make sense to me (IMHO, IMHO).

Please check the rest of the previous mail for my suggested solution.

And your solution although ingenious and well thought out - introduces
unnecessarily complicated methods that could be very easily
represented more simply.

So the question after all this is are we open to "any" changes OR are
we looking at a cap on API 1.2. If thats the case then you actually
set the stage for all manner of hacks. Ours was just one use case
which does not fit the present API easily. There will be more ... and
btw making a model to store features is also a "hack". Albeit a clever
one.

Regards
Surajit

> Regards,
> Nina
>
> On 2 December 2010 15:34, surajit ray <mr.surajit.ray at gmail.com> wrote:
>
>> Hi Nina,
>>
>> To organise the discussions better, I have created a new page to
>> capture the discussion on featuresets.
>> http://opentox.org/dev/apis/api-1.2/featureset-and-workarounds
>>
>> I have moved your discussion points on to  this page from the
>> substructure wishlist page ....
>>
>> On 30 November 2010 18:43, Nina Jeliazkova <jeliazkova.nina at gmail.com>
>> wrote:
>> > No problem to extend the API to be able to group features. (In fact we
>> have
>> > this implemented , even with hierarchical grouping
>> > http://apps.ideaconsult.net:8080/ambit2/template/Taxonomy , turns to be
>> > quite useful for ToXML representation ). Could be documented and included
>> in
>> > the API.
>>
>> Yeah lets have featuresets please ...
>>
>> > IMHO there is no sense to assign a feature or a featureset to a dataset ,
>> > without specifying what is the relationship between dataset and features.
>> > This is perfectly served by algorithm/model approach so far.
>> >
>>
>> We can capture an explicit relationship in a "FeaturesetValue" every
>> time we assign a "Featureset" to a dataset. Its explicit, simpler and
>> we can even put the URI of the creating algorithm in the
>> FeaturesetValue.
>>
>> > The dummy dataset suggestion is a hack , which lack consistency and I am
>> not
>> > in favour of it.
>>
>> I guess the same can be said of Christoph's method of assigning
>> substructures to compounds to actually just capture the substructure
>> set in one dataset. On the flip side - every API that I have worked
>> with (Google Maps API, Facebook API, Facebook Graphs API, Flex/Flash)
>> has a "hack" which became the norm. IHMO in this case though the best
>> solution is to have a Featureset with a FeaturesetValue to explicitly
>> outline the relationship to the dataset.
>>
>>
>> >
>> > See above for sets of features.  What we would like to have more than
>> other
>> > libraries in OpenTox is to be able to tell how these features have been
>> > calculated.
>> >
>> > The CDK fingerprinter ( if you mean
>> > org.openscience.cdk.fingerprint.Fingerprinter ) is not a good example
>> here,
>> > since it uses hashed fingerprints, which is almost impossible to
>> translate
>> > to SMARTS.
>> >
>> >  The CDK does very good job for specifying descriptors metadata via
>> > ontology, but this is not (yet?) done for fingerprinting (as far as I
>> know),
>> > although fingerprinter algorithm could be included in BlueObelisk or
>> ChemInf
>> > ontology the same way descriptor algorithms are.
>> >
>> >
>>
>> >
>> > Sorry, this was my impression from earlier discussions.
>> >
>> > Look at my examples, this is exactly what comes from the MCSS model.
>> >
>> > Having single compound with substructures, assigned as features is
>> > inconsistent for the following reason.
>> >
>> > The meaning of the set of substructures is that they have been obtained
>> by
>> > MCSS (or fingerprinting algorithm) , and are MCSS structures for the
>> entire
>> > dataset.  Assigning them to a single dummy compound means all this
>> > information is lost.
>>
>> W.R.T the hack - yes we lose the information - and yet its many times
>> simpler than creating a model just to represent a set of features
>> (substructures).
>>
>> A featureset with featuresetValue solves this problem well, without
>> resorting to needless model building step or the "hack".
>> >
>>
>> >
>> > Having fragments submitted to another fingerprinter algorithm, which by
>> > definition works on whole compounds is essentially mixing substructures
>> with
>> > compounds.  What if the fingerprinter algorithm starts to normalize the
>> > fragments as if they are compounds?
>>
>> The fingerprinter in this case is going to take two inputs - dataset
>> and featureset(substructureset). Again wheres the question of mixing
>> the two ?
>>
>> >
>>
>> >>
>> >
>> > curl -X GET /model/id/predicted  gives you list of features, URLs or RDF
>> > .There is no need to extract anything.  We could easily add a new mime
>> type
>> > to support SMARTS for feature representation (whenever relevant) and
>> you'll
>> > get list of smarts by something like
>> >
>> > curl -X GET -H "Accept:chemical/x-smarts"  /model/id/predicted
>> >
>> > (Hm,  is there MIME format for SMARTS )
>> >
>> > Besides,  the current scheme supports ANY kind of fingerprinter,
>> regardless
>> > if it extracts fragments in the form of SMILES/SMARTS or just report some
>> > encoded strings (as PubChem fingerprinter) or un-interpretable bits (as
>> > hashed fingerprints).
>>
>> Smarts Mime type on the model/id/predicted URL can provide a list of
>> features - but it is a non-generic way of representing the set of
>> features. It is non-generic to imagine that a fingerprinting algorithm
>> takes a model/id/predicted as input - especially since the "model" may
>> not have any relationship with the fingerprinter.
>>
>>
>> >
>> > No, not a dataset comprising of features, but a dataset, comprising of
>> > compounds.
>> >
>> > You could define an algorithm to have list of features as input  into a
>> new
>> > dataset of compounds , if there is meaningful way to do so (e.g. for
>> smarts
>> > based features).  This means there will be no assumption features are
>> > compounds, but a documented service that does the conversion in a known
>> way.
>>
>> Why would we need to convert features to compounds ?
>>
>> >
>> >
>> >>
>> >> > curl -X POST /algorithm/features2dataset -d
>> >> > "feature_uris[]=/model/mcss1/predicted"   ->
>> >> > /dataset/newdatasetfromfeatures
>> >> >
>> >> > Then you are done, POST the dataset into other algorithms as usual.
>> >>
>> >> I am sorry but I could not understand how a dataset will be created in
>> >> this case.
>> >>
>> >>
>> > Setup an algorithm service, which will  read the features, find if a
>> feature
>> > is a substructure,  generate compounds for them (e.g. SDF  file)  and
>> post
>> > the SDF content to a dataset service - thus it will create a dataset.
>>
>> Again I could not get the requirement for such a service - I would
>> like to capture "substructures" (as features) not convert them to
>> compounds.
>>
>>
>> >
>> > As well for individual compounds.
>> >
>> >
>>
>> >
>> > IMHO it is not easy to understand  what means "assigning" feature set to
>> a
>> > dataset?  It doesn't tell neither where the feature set came from, nor
>> how
>> > it is related to a dataset.  This is all lost of information, which all
>> > contributes to the poor reproducibility of any models.
>>
>> Yeah a featuresetValue to capture that explicit value of the
>> relationship is just what is needed here.
>>
>> > So far OpenTox has a very simple and logical API (yes, I have heard this
>> > from external developers) - Datasets are processed by algorithms/models
>> and
>> > written to datasets - that's all basically.
>> >
>>
>> And yet we you want to build this complex logic of making a Model to
>> essentially store a set of features ?
>>
>> >
>> >
>> >
>> > Just need a feature set , regardless of how it was obtained ... that's
>> how
>> > irreproducible cheminformatics models are born...
>>
>> Most industry chemoinformatics is quite irreproducable - but is paid
>> for an viable ! Also the onus to reproduce the results are with the
>> user - not with the datasets and algorithms in the services.
>>
>> >
>> >
>> > If you use dummy dataset with dummy compounds, you are introducing a mess
>> > into datasets service.  Because those dummy compounds and features, which
>> > are not really  features for that compound will appear as a result of
>> > searches , hitting that compound.
>>
>> So essentially you are saying we cannot have any datasets with some
>> feature values as "false" or 0 to denote absence in a compound ?
>>
>> >
>> >> Ideally however I still maintain its important to have featuresets.
>> >>
>> >
>> > Feature sets alone are fine, see above.
>>
>> Then lets go for it. We have thought a lot about it and the indirect
>> methods suggested till now - just seem to be an attempt to stonewall
>> any big changes in API. Which from the perspective of an API is hara
>> kiri ? Google Maps API released 3 major versions in 5 years. By
>> comparison our API upgrades are just moving from 1.1 to 1.2 in 2
>> years, with barely any changes ...
>>
>> >
>> > A generic API means it could be applied to great variety of problems.
>> > Specific solutions introduce incompatibility.
>> > It's a generic computer science approach - try to abstract things, break
>> > larger problem into smaller pieces, find the commonalities.  That's how
>> IT
>> > works ...
>>
>> So are we doing chemoinformatics (vs just toxicity) ? Are we
>> representing atomic features ? If both your answers are "no" then we
>> can safely say we have not abstracted enough !
>>
>> >
>> > Having your simple information represented in a way specific for your
>> > problem doesn't make things compatible ... Now everybody could use
>> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss to retrieve MCSS
>> > structures from dataset of their choice and then run it through any of
>> Weka
>> > algorithms available by any partner.  Will your approach do anything
>> similar
>> > ?
>>
>> I can't see a point to debate here ... I am looking for a generic
>> solution for collecting features. This  can be achieved quite simply
>> by having a featureset and a featuresetValue (when assigning to a
>> dataset) to explicitly capture the value of the relationship. The
>> beauty is we do not need a "model" in the middle to just capture some
>> explicit relationships.
>>
>>
>> Regards
>> Surajit
>>
>> > Regards,
>> > Nina
>> >
>> >
>> >>
>> >> Regards
>> >> Surajit
>> >>
>> >>
>> >> > Well, my point of view is that an algorithm applied to specific data
>> with
>> >> > specific parameters should be considered a model (descriptor
>> calculations
>> >> > included).  An algorithm is just abstract sequence of steps, when one
>> >> > applies it to data with specific parameters, then a model is
>> generated.
>> >> > This will make the API much more consistent (now some algorithms
>> generate
>> >> a
>> >> > model, and results of other algorithms is a dataset, which is quite
>> >> > confusing for external developers). But at this point  I am not
>> insisting
>> >> on
>> >> > changing the   API that far ;)
>> >> >
>> >> > Regards,
>> >> > Nina
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> Regards
>> >> >> Surajit
>> >> >>
>> >> >> > Regards,
>> >> >> > Nina
>> >> >> >
>> >> >> >
>> >> >> >>
>> >> >> >> Cheers
>> >> >> >> Surajit
>> >> >> >>
>> >> >> >>
>> >> >> >> On 29 November 2010 14:05, Nina Jeliazkova <
>> >> jeliazkova.nina at gmail.com>
>> >> >> >> wrote:
>> >> >> >> > Dear Christoph, Surajit, All,
>> >> >> >> >
>> >> >> >> > This discussion is very useful.
>> >> >> >> >
>> >> >> >> > As a result of myself trying to understand both points of view,
>> >>  now
>> >> >> we
>> >> >> >> have
>> >> >> >> > MCSS algorithm as ambit service  (thanks to CDK SMSD package).
>> >> >> >> >
>> >> >> >> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss
>> >> >> >> >
>> >> >> >> > It can be applied to a dataset and generates a model, where
>> >> predicted
>> >> >> >> > features (MCSS in this case) are available via
>> >> ot:predictedVariables
>> >> >> >> > (example
>> >> >> https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted)
>> >> >> >> > The features use current API, without any change (although
>> having
>> >> >> >> > ot:Substructure subclass of ot:Feature will make it more clear).
>> >> >> >> >
>> >> >> >> > All the MCSS substructures can be used by any learning algorithm
>> ,
>> >> as
>> >> >> >> they
>> >> >> >> > are standard ot:Features.
>> >> >> >> >
>> >> >> >> > Here are more details and proposal (start from *Substructure API
>> >> >> proposal
>> >> >> >> > heading *)
>> >> >> >> >
>> >> >> >> > http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist
>> >> >> >> >
>> >> >> >> > Best regards,
>> >> >> >> > Nina
>> >> >> >> >
>> >> >> >> > P.S. Please note the /mcss algorithm might be slow for large
>> >> datasets,
>> >> >> >> there
>> >> >> >> > are several improvements that we'll be applying  performance
>> wise,
>> >> but
>> >> >> >> this
>> >> >> >> > will not change the API .
>> >> >> >> >
>> >> >> >> > On 25 November 2010 18:13, Christoph Helma <helma at in-silico.ch>
>> >> >> wrote:
>> >> >> >> >
>> >> >> >> >> Excerpts from surajit ray's message of Thu Nov 25 14:49:19
>> +0100
>> >> >> 2010:
>> >> >> >> >>
>> >> >> >> >> > > This type of representation (we are using it internally)
>> has
>> >> >> served
>> >> >> >> >> well
>> >> >> >> >> > > for our datasets which might contain also several (10-100)
>> >> >> thousand
>> >> >> >> >> > > substructures for a few thousands compounds. I also do not
>> >> think,
>> >> >> >> that
>> >> >> >> >> > > the representation is redundant:
>> >> >> >> >> > >        - each compound is represented once
>> >> >> >> >> > >        - each substructure is represented once
>> >> >> >> >> > >        - each association between compound and substructure
>> is
>> >> >> >> >> represented once
>> >> >> >> >> > > Please correct me, if I am missing something obvious.
>> >> >> >> >> >
>> >> >> >> >> > According to this representation each dataEntry for a
>> compound
>> >> will
>> >> >> >> >> > have to have all substructure features that were found in
>> them.
>> >> >> >> >> > Therefore each dataEntry may have 1000-10000
>> >> feature/featureValue
>> >> >> >> >> > pairs . For 500 datasentries that means on an average of
>> >> >> >> >> > 500*5000(assuming 5000 substructures) = 2,500,000
>> >> >> feature/featureValue
>> >> >> >> >> > pairs - thats 2.5 million !
>> >> >> >> >>
>> >> >> >> >> In our case it is a lot less (not completely sure about your
>> >> feature
>> >> >> >> >> types), because only a very small subset of features occurs in
>> a
>> >> >> single
>> >> >> >> >> compound.
>> >> >> >> >>
>> >> >> >> >> > versus just having a featureset with a
>> >> >> >> >> > 5000 feature entries. You can imagine the difference in cost
>> of
>> >> >> >> >> > bandwidth,computation etc.
>> >> >> >> >>
>> >> >> >> >> I am not sure, if I get you right, but where do you want to
>> store
>> >> the
>> >> >> >> >> relationships between features and compounds? If there are
>> really
>> >> 2.5
>> >> >> >> >> million associations you have to assert them somewhere. And
>> having
>> >> >> >> features
>> >> >> >> >> without compounds seems to be quite useless for me.
>> >> >> >> >>
>> >> >> >> >> > >
>> >> >> >> >> > > Adding "false" occurences would not violate the current API
>> >> (but
>> >> >> >> would
>> >> >> >> >> > > add redundant information). Keep in mind that the dataset
>> >> >> >> >> representation
>> >> >> >> >> > > is mainly for exchanging datasets between services -
>> >> internally
>> >> >> you
>> >> >> >> can
>> >> >> >> >> > > use any datastructure that is efficient for your purposes
>> (we
>> >> >> also
>> >> >> >> do
>> >> >> >> >> > > that in our services). So if you need fingerprints
>> internally,
>> >> >> >> extract
>> >> >> >> >> > > them from the dataset.
>> >> >> >> >> >
>> >> >> >> >> > Internalizing an intermediate step completely serves the
>> purpose
>> >> >> but
>> >> >> >> >> > leads to less flexible design paradigms. If we internalize
>> the
>> >> >> >> >> > workflow from substructure extraction to fingerprinting - we
>> >> will
>> >> >> lose
>> >> >> >> >> > the ability to provide the data to a third party server for
>> an
>> >> >> >> >> > independent workflow. Of course the reasoning could be "who
>> >> needs
>> >> >> it
>> >> >> >> >> > ?" - well you never know !!
>> >> >> >> >>
>> >> >> >> >> I am very interested in exchanging "fingerprints" with other
>> >> >> services,
>> >> >> >> >> but that can be done already with the current API. I see
>> >> fingerprints
>> >> >> as
>> >> >> >> >> sets of features that are present in a compound (also using set
>> >> >> >> >> operations to calculate similarities), and find it fairly
>> >> >> >> >> straightforward to parse/serialize them to/from datasets.
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> > >> I still suggest having a FeatureSet/SubstructureSet type
>> >> object
>> >> >> >> within
>> >> >> >> >> > >> the API to make it convenient to club features without
>> >> compound
>> >> >> >> >> > >> representations.
>> >> >> >> >> > >
>> >> >> >> >> > > I prefer to keep the API as generic as possible and not to
>> >> >> introduce
>> >> >> >> >> > > ad-hoc objects (or optimizations) for special purposes -
>> >> >> otherwise
>> >> >> >> it
>> >> >> >> >> > > will be difficult to maintain services in the long term.
>> Why
>> >> >> don't
>> >> >> >> you
>> >> >> >> >> > > use ontologies for grouping features?
>> >> >> >> >> >
>> >> >> >> >> > Grouping features using ontologies is clubbing the features
>> Not
>> >> the
>> >> >> >> >> > feature values
>> >> >> >> >>
>> >> >> >> >> But you cannot have feature values without relating features to
>> >> >> >> >> compounds. If you use the representation I proposed feature
>> values
>> >> >> are
>> >> >> >> >> "true" anyway.
>> >> >> >> >>
>> >> >> >> >> > So how do we know mcss3 occuring in compound X is with
>> respect
>> >> to
>> >> >> >> >> > which compound. As you said we can have arbitary fields in
>> the
>> >> >> feature
>> >> >> >> >> > definitions (for MCSS) - but that would be outside API
>> >> definitions.
>> >> >> >> >>
>> >> >> >> >> features:
>> >> >> >> >>        mcss3:
>> >> >> >> >>                ot:componds:
>> >> >> >> >>                        - compound2
>> >> >> >> >>                        - compound3
>> >> >> >> >>                ot:smarts: smarts3
>> >> >> >> >>
>> >> >> >> >> In my understanding you can add any annotation you want to a
>> >> feature.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> > Yes, you can, but if this is not an agreed annotation,  no other
>> >> >> service
>> >> >> >> > will understand it.
>> >> >> >> >
>> >> >> >> > Best regards,
>> >> >> >> > Nina
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >>  Best regards,
>> >> >> >> >> Christoph
>> >> >> >> >> _______________________________________________
>> >> >> >> >> Development mailing list
>> >> >> >> >> Development at opentox.org
>> >> >> >> >> http://www.opentox.org/mailman/listinfo/development
>> >> >> >> >>
>> >> >> >> > _______________________________________________
>> >> >> >> > Development mailing list
>> >> >> >> > Development at opentox.org
>> >> >> >> > http://www.opentox.org/mailman/listinfo/development
>> >> >> >> >
>> >> >> >> _______________________________________________
>> >> >> >> Development mailing list
>> >> >> >> Development at opentox.org
>> >> >> >> http://www.opentox.org/mailman/listinfo/development
>> >> >> >>
>> >> >> > _______________________________________________
>> >> >> > Development mailing list
>> >> >> > Development at opentox.org
>> >> >> > http://www.opentox.org/mailman/listinfo/development
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Surajit Ray
>> >> >> Partner
>> >> >> www.rareindianart.com
>> >> >> _______________________________________________
>> >> >> Development mailing list
>> >> >> Development at opentox.org
>> >> >> http://www.opentox.org/mailman/listinfo/development
>> >> >>
>> >> > _______________________________________________
>> >> > Development mailing list
>> >> > Development at opentox.org
>> >> > http://www.opentox.org/mailman/listinfo/development
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Surajit Ray
>> >> Partner
>> >> www.rareindianart.com
>> >> _______________________________________________
>> >> Development mailing list
>> >> Development at opentox.org
>> >> http://www.opentox.org/mailman/listinfo/development
>> >>
>> > _______________________________________________
>> > Development mailing list
>> > Development at opentox.org
>> > http://www.opentox.org/mailman/listinfo/development
>> >
>> _______________________________________________
>> Development mailing list
>> Development at opentox.org
>> http://www.opentox.org/mailman/listinfo/development
>>
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>

-- 
Surajit Ray
Partner
www.rareindianart.com