[OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms

Tue Nov 30 14:13:33 CET 2010

Hi,

> >>
> >>
> >> On 30 November 2010 14:15, Nina Jeliazkova <jeliazkova.nina at gmail.com>
> >> wrote:
> >> > Hi,
> >> >
> >> >
> >> > On 30 November 2010 10:29, surajit ray <mr.surajit.ray at gmail.com>
> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> Another method that does not break the API as well as captures
> feature
> >> >> sets is to create a dataset with one compound (maybe a C or CC) and
> >> >> assign all the substructure features to it (with value as false or
> >> >> true). In the dc:source of the dataset we can mention the dataset
> from
> >> >> which it was derived. And in the description we can describe it as a
> >> >> dataset to store MCSS features from dataset (or whatever the
> >> >> relationship with the mother dataset).
> >> >>
> >> >> I think this would be a simpler method than creating a new Model just
> >> >> for storing substructures.
> >> >
> >> >
> >> > It might seem simpler, but is definitely less consistent, as it
> implies
> >> > different meaning of dataset and properties and their relationships.
> >>  There
> >> > will be no explicit relationship to the algorithm/model , doing the
> >> > processing, which makes MCSS a specific case and breaks OpenTox API ,
> >> where
> >> > algorithms and models are the procedures, that process data, and this
> is
> >> > explicitly stored in the generated data objects.
> >>
> >> The relationship is defined in dc:description of the "featureset". It
> >> is explicit. Secondly a reference to the algorithm which generated
> >> this can also be stored in the description.
> >>
> >
> >
> > dc:description is an annotation property and does not define any
> > relationship between classes.
>
> Agreed
>
> > Besides, this breaks OpenTox API, as it differs from the way other
> > relationships are defined.
>
> We are discussing also - enhancing the API. The enhancement we were
> seeking was about having featuresets assignable to datasets. In the
> absence of which having a dummy dataset with a single compound (not a
> fragment) - with the substructures assigned as features can be
> considered a feature set.
>
> No problem to extend the API to be able to group features. (In fact we have
this implemented , even with hierarchical grouping
http://apps.ideaconsult.net:8080/ambit2/template/Taxonomy , turns to be
quite useful for ToXML representation ). Could be documented and included in
the API.

IMHO there is no sense to assign a feature or a featureset to a dataset ,
without specifying what is the relationship between dataset and features.
This is perfectly served by algorithm/model approach so far.

The dummy dataset suggestion is a hack , which lack consistency and I am not
in favour of it.

> >>
> >> >
> >> > With the current scheme, it is easy to handle algorithms like Kabsh
> >> > alignment for a dataset with the same  generic mechanism as for MCSS
> (I
> >> am
> >> > sure there will be more cases like this). I don't see the point of
> >> inventing
> >> > specific solution for a single case , while it could be handled in a
> >> generic
> >> > way (agree with earlier comment by Christoph on that ).
> >>
> >> This is not a specific solution but a very general one - one which
> >> addresses a basic need within any chemistry api - which is to
> >> represent sets of features independently of compounds.
> >>
> >
> > We have features(ot:Feature)  independent of compounds (ot:Compound) .
> What
> > makes most sense in modeling , is to have relationship between features
> and
> > compounds (the values).
>
> Having sets of features also makes perfect sense.

> Numerous
> fingerprinting libraries carry sets of smarts for features without
> them being assigned to any compound. The CDK fingerprinter also uses
> such sets.
>

See above for sets of features.  What we would like to have more than other
libraries in OpenTox is to be able to tell how these features have been
calculated.

The CDK fingerprinter ( if you mean
org.openscience.cdk.fingerprint.Fingerprinter ) is not a good example here,
since it uses hashed fingerprints, which is almost impossible to translate
to SMARTS.

 The CDK does very good job for specifying descriptors metadata via
ontology, but this is not (yet?) done for fingerprinting (as far as I know),
although fingerprinter algorithm could be included in BlueObelisk or ChemInf
ontology the same way descriptor algorithms are.

>
> > What you are implying is that substructures are both features and
> compounds
> > - which they are not, and mixing them is leading to errors and confusion.
>
> I have not implied that. What I have implied is having a single
> COMPOUND in the dataset with the substructures assigned as features.
>

Sorry, this was my impression from earlier discussions.

Look at my examples, this is exactly what comes from the MCSS model.

Having single compound with substructures, assigned as features is
inconsistent for the following reason.

The meaning of the set of substructures is that they have been obtained by
MCSS (or fingerprinting algorithm) , and are MCSS structures for the entire
dataset.  Assigning them to a single dummy compound means all this
information is lost.

>
> > If you have substructure "C" and use it for SMARTS searching, it will
> look
> > for a single carbon atom.  If you have a compound , defined by smiles "C"
> ,
> > it implies CH4 , which is different.  Mixing both is not a good idea,
> making
> > the difference explicit makes harder to misinterpret things.
> >
>
> Wheres the question of mixing ?
>

Having fragments submitted to another fingerprinter algorithm, which by
definition works on whole compounds is essentially mixing substructures with
compounds.  What if the fingerprinter algorithm starts to normalize the
fragments as if they are compounds?

>
> >>
> >> > Besides, the model is definitely not for just storing substructures,
> it
> >> can
> >> > and will be used  for predictions of new compounds (if they have those
> >> > substructures ) in an uniform way  (POST a new compound to the MCSS
> model
> >> > and you'll get if its MCSS substructures are one of existing ones, or
> it
> >> is
> >> > different and far way from that dataset).
> >>
> >> What if I have a better graph comparator algorithm for fingerprinting
> >> - will that take a model as an input just to extract features ?
> >>
> >
> > No, define  /algorithm/myfingerprint  , which takes feature_uris[] as
> input
> > parameters
>
> So some other process has to go to the model and extract the features
> from its set of predicted variables and then supply that information
> to the fingerprinter. IHMO its simpler to just supply a featureset(of
> substructures/smarts) to the fingerprinter.
>

curl -X GET /model/id/predicted  gives you list of features, URLs or RDF
.There is no need to extract anything.  We could easily add a new mime type
to support SMARTS for feature representation (whenever relevant) and you'll
get list of smarts by something like

curl -X GET -H "Accept:chemical/x-smarts"  /model/id/predicted

(Hm,  is there MIME format for SMARTS )

Besides,  the current scheme supports ANY kind of fingerprinter, regardless
if it extracts fragments in the form of SMILES/SMARTS or just report some
encoded strings (as PubChem fingerprinter) or un-interpretable bits (as
hashed fingerprints).

>
> > curl -X POST /algorithm/myfingerprint -d
> > "feature_uris[]=/model/mcss1/predicted"
> >
> > This will work with any set of features.
> >
> >
> > Even better, if you would like to convert features to a dataset,  define
> a
> > converter algorithm, which takes features, verifies if they are
> > substructures and generates a dataset.
>
> A dataset comprising of what compounds ? You mean a dataset comprising
> just of features - I thought that was not possible ?
>

No, not a dataset comprising of features, but a dataset, comprising of
compounds.

You could define an algorithm to have list of features as input  into a new
dataset of compounds , if there is meaningful way to do so (e.g. for smarts
based features).  This means there will be no assumption features are
compounds, but a documented service that does the conversion in a known way.

>
> > curl -X POST /algorithm/features2dataset -d
> > "feature_uris[]=/model/mcss1/predicted"   ->
> > /dataset/newdatasetfromfeatures
> >
> > Then you are done, POST the dataset into other algorithms as usual.
>
> I am sorry but I could not understand how a dataset will be created in
> this case.
>
>
Setup an algorithm service, which will  read the features, find if a feature
is a substructure,  generate compounds for them (e.g. SDF  file)  and post
the SDF content to a dataset service - thus it will create a dataset.

> >
> >>
> >> > The problem with the model approach is that
> >> >> 1) The substructures cannot be easily downloaded without accessing
> the
> >> >> model
> >> >>
> >> >
> >> > They can - /model/id/predicted  give you the list of features  (see my
> >> > examples)
> >>
> >> Well of course a whole model infrastructure may provide a way to
> >> extract the predicted feature set. But that woulets see how the present
> API plays out in the long term ...
> >> model as an input a third party fingerprinter.
> >>
> >>
> > Not necessarily, see above
> >
> >
> >>
> >> > And also - this is exactly the advantage - you don't have just a set
> of
> >> > substructures you don't know when they are coming from, but everything
> is
> >> > explicitly defined - the substructures are result of applying given
> >> > algorithm on given dataset.
> >>
> >> We know that from the dc:source and dc:description
> >>
> >
> > No, we don't. These are annotation properties, not object properties.
>  They
> > might provide hints for human readers, while the whole framework strives
> to
> > provide explicit relationships for automatic processing.
>
> We can have a ot:source field for datasets as well for the
> algorithm/model/logic that created it.
>

As well for individual compounds.

>
> >
> >> >
> >> >
> >> >> 2) The set of substructures cannot be given to a better finger
> printer
> >> >> (maybe with a faster graph comparator)
> >> >>
> >> >>
> >> > Of course they can - once we have smarts representation of the
> >> > ot:Substructure - what is the obstacle of feeding them into any other
> >> > algorithm ?
> >>
> >> Again the question - are we going to use a model as an input to
> >> another algorithm to extract features ?
> >>
> >
> > No, see above - features are already available as /model/id/predicted  -
> > this is just set of features.
> >
> >
> >> >
> >> >
> >> >
> >> >> The fingerprinter in such a case becomes a separate algorithm which
> >> >> can take a dataset as input as well as a "featureset" - which is
> >> >> actually a dummy dataset with the full list of features.
> >> >>
> >> >
> >> > A fingerprinter should be indeed an algorithm - this is how OpenTox
> API
> >> is
> >> > designed.   Any processing should be instantiated as an algorithm.
> >>
> >> In your case the fingerprinter is a model ....
> >>
> >
> > No, the fingerprinter itself (/algorithm/mcss ) is ot:Algorithm. Only
> after
> > it is applied to specific dataset, it becomes a model of exactly that
> > dataset.
>
> All this versus just having a featureset and assigning it to a
> dataset. IHMO - its not simple to understand either for a third part
> developer.
>

IMHO it is not easy to understand  what means "assigning" feature set to a
dataset?  It doesn't tell neither where the feature set came from, nor how
it is related to a dataset.  This is all lost of information, which all
contributes to the poor reproducibility of any models.

So far OpenTox has a very simple and logical API (yes, I have heard this
from external developers) - Datasets are processed by algorithms/models and
written to datasets - that's all basically.

>
> All we need is a method to address a set of features to a dataset. For
> that matter every feature value assigned to a compound also should
> have an explicitly mentioned relationship through a model etc. But we
> are happily neglecting that.
>
> Anyway, all things said and done - I guess for those who need to have
> an explicit - ot defined relationship between featureset and dataset
> can use the model approach. For those that just need a featureset
>

Just need a feature set , regardless of how it was obtained ... that's how
irreproducible cheminformatics models are born...

> (with reference to the logic and source dataset) to work with - can
> use the dummy datasets for replacement. Either would not break the
> present API.
>

If you use dummy dataset with dummy compounds, you are introducing a mess
into datasets service.  Because those dummy compounds and features, which
are not really  features for that compound will appear as a result of
searches , hitting that compound.

> Ideally however I still maintain its important to have featuresets.
>

Feature sets alone are fine, see above.

>
> A generic API might be great as a flexible resource, but most
> development kits do provide more specific resources. Representing
> simple information in a complicated manner may make every thing
> complaint within our work group, but will be one of the hurdles to
> overcome for external developers. Thats my two cents on this
> discussion ....
>

A generic API means it could be applied to great variety of problems.
Specific solutions introduce incompatibility.
It's a generic computer science approach - try to abstract things, break
larger problem into smaller pieces, find the commonalities.  That's how IT
works ...

Having your simple information represented in a way specific for your
problem doesn't make things compatible ... Now everybody could use
https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss to retrieve MCSS
structures from dataset of their choice and then run it through any of Weka
algorithms available by any partner.  Will your approach do anything similar
?

Regards,
Nina

>
> Regards
> Surajit
>
>
> > Well, my point of view is that an algorithm applied to specific data with
> > specific parameters should be considered a model (descriptor calculations
> > included).  An algorithm is just abstract sequence of steps, when one
> > applies it to data with specific parameters, then a model is generated.
> > This will make the API much more consistent (now some algorithms generate
> a
> > model, and results of other algorithms is a dataset, which is quite
> > confusing for external developers). But at this point  I am not insisting
> on
> > changing the   API that far ;)
> >
> > Regards,
> > Nina
> >
> >
> >
> >
> >>
> >>
> >> Regards
> >> Surajit
> >>
> >> > Regards,
> >> > Nina
> >> >
> >> >
> >> >>
> >> >> Cheers
> >> >> Surajit
> >> >>
> >> >>
> >> >> On 29 November 2010 14:05, Nina Jeliazkova <
> jeliazkova.nina at gmail.com>
> >> >> wrote:
> >> >> > Dear Christoph, Surajit, All,
> >> >> >
> >> >> > This discussion is very useful.
> >> >> >
> >> >> > As a result of myself trying to understand both points of view,
>  now
> >> we
> >> >> have
> >> >> > MCSS algorithm as ambit service  (thanks to CDK SMSD package).
> >> >> >
> >> >> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss
> >> >> >
> >> >> > It can be applied to a dataset and generates a model, where
> predicted
> >> >> > features (MCSS in this case) are available via
> ot:predictedVariables
> >> >> > (example
> >> https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted)
> >> >> > The features use current API, without any change (although having
> >> >> > ot:Substructure subclass of ot:Feature will make it more clear).
> >> >> >
> >> >> > All the MCSS substructures can be used by any learning algorithm ,
> as
> >> >> they
> >> >> > are standard ot:Features.
> >> >> >
> >> >> > Here are more details and proposal (start from *Substructure API
> >> proposal
> >> >> > heading *)
> >> >> >
> >> >> > http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist
> >> >> >
> >> >> > Best regards,
> >> >> > Nina
> >> >> >
> >> >> > P.S. Please note the /mcss algorithm might be slow for large
> datasets,
> >> >> there
> >> >> > are several improvements that we'll be applying  performance wise,
> but
> >> >> this
> >> >> > will not change the API .
> >> >> >
> >> >> > On 25 November 2010 18:13, Christoph Helma <helma at in-silico.ch>
> >> wrote:
> >> >> >
> >> >> >> Excerpts from surajit ray's message of Thu Nov 25 14:49:19 +0100
> >> 2010:
> >> >> >>
> >> >> >> > > This type of representation (we are using it internally) has
> >> served
> >> >> >> well
> >> >> >> > > for our datasets which might contain also several (10-100)
> >> thousand
> >> >> >> > > substructures for a few thousands compounds. I also do not
> think,
> >> >> that
> >> >> >> > > the representation is redundant:
> >> >> >> > >        - each compound is represented once
> >> >> >> > >        - each substructure is represented once
> >> >> >> > >        - each association between compound and substructure is
> >> >> >> represented once
> >> >> >> > > Please correct me, if I am missing something obvious.
> >> >> >> >
> >> >> >> > According to this representation each dataEntry for a compound
> will
> >> >> >> > have to have all substructure features that were found in them.
> >> >> >> > Therefore each dataEntry may have 1000-10000
> feature/featureValue
> >> >> >> > pairs . For 500 datasentries that means on an average of
> >> >> >> > 500*5000(assuming 5000 substructures) = 2,500,000
> >> feature/featureValue
> >> >> >> > pairs - thats 2.5 million !
> >> >> >>
> >> >> >> In our case it is a lot less (not completely sure about your
> feature
> >> >> >> types), because only a very small subset of features occurs in a
> >> single
> >> >> >> compound.
> >> >> >>
> >> >> >> > versus just having a featureset with a
> >> >> >> > 5000 feature entries. You can imagine the difference in cost of
> >> >> >> > bandwidth,computation etc.
> >> >> >>
> >> >> >> I am not sure, if I get you right, but where do you want to store
> the
> >> >> >> relationships between features and compounds? If there are really
> 2.5
> >> >> >> million associations you have to assert them somewhere. And having
> >> >> features
> >> >> >> without compounds seems to be quite useless for me.
> >> >> >>
> >> >> >> > >
> >> >> >> > > Adding "false" occurences would not violate the current API
> (but
> >> >> would
> >> >> >> > > add redundant information). Keep in mind that the dataset
> >> >> >> representation
> >> >> >> > > is mainly for exchanging datasets between services -
> internally
> >> you
> >> >> can
> >> >> >> > > use any datastructure that is efficient for your purposes (we
> >> also
> >> >> do
> >> >> >> > > that in our services). So if you need fingerprints internally,
> >> >> extract
> >> >> >> > > them from the dataset.
> >> >> >> >
> >> >> >> > Internalizing an intermediate step completely serves the purpose
> >> but
> >> >> >> > leads to less flexible design paradigms. If we internalize the
> >> >> >> > workflow from substructure extraction to fingerprinting - we
> will
> >> lose
> >> >> >> > the ability to provide the data to a third party server for an
> >> >> >> > independent workflow. Of course the reasoning could be "who
> needs
> >> it
> >> >> >> > ?" - well you never know !!
> >> >> >>
> >> >> >> I am very interested in exchanging "fingerprints" with other
> >> services,
> >> >> >> but that can be done already with the current API. I see
> fingerprints
> >> as
> >> >> >> sets of features that are present in a compound (also using set
> >> >> >> operations to calculate similarities), and find it fairly
> >> >> >> straightforward to parse/serialize them to/from datasets.
> >> >> >>
> >> >> >> >
> >> >> >> > >> I still suggest having a FeatureSet/SubstructureSet type
> object
> >> >> within
> >> >> >> > >> the API to make it convenient to club features without
> compound
> >> >> >> > >> representations.
> >> >> >> > >
> >> >> >> > > I prefer to keep the API as generic as possible and not to
> >> introduce
> >> >> >> > > ad-hoc objects (or optimizations) for special purposes -
> >> otherwise
> >> >> it
> >> >> >> > > will be difficult to maintain services in the long term. Why
> >> don't
> >> >> you
> >> >> >> > > use ontologies for grouping features?
> >> >> >> >
> >> >> >> > Grouping features using ontologies is clubbing the features Not
> the
> >> >> >> > feature values
> >> >> >>
> >> >> >> But you cannot have feature values without relating features to
> >> >> >> compounds. If you use the representation I proposed feature values
> >> are
> >> >> >> "true" anyway.
> >> >> >>
> >> >> >> > So how do we know mcss3 occuring in compound X is with respect
> to
> >> >> >> > which compound. As you said we can have arbitary fields in the
> >> feature
> >> >> >> > definitions (for MCSS) - but that would be outside API
> definitions.
> >> >> >>
> >> >> >> features:
> >> >> >>        mcss3:
> >> >> >>                ot:componds:
> >> >> >>                        - compound2
> >> >> >>                        - compound3
> >> >> >>                ot:smarts: smarts3
> >> >> >>
> >> >> >> In my understanding you can add any annotation you want to a
> feature.
> >> >> >>
> >> >> >>
> >> >> > Yes, you can, but if this is not an agreed annotation,  no other
> >> service
> >> >> > will understand it.
> >> >> >
> >> >> > Best regards,
> >> >> > Nina
> >> >> >
> >> >> >
> >> >> >>  Best regards,
> >> >> >> Christoph
> >> >> >> _______________________________________________
> >> >> >> Development mailing list
> >> >> >> Development at opentox.org
> >> >> >> http://www.opentox.org/mailman/listinfo/development
> >> >> >>
> >> >> > _______________________________________________
> >> >> > Development mailing list
> >> >> > Development at opentox.org
> >> >> > http://www.opentox.org/mailman/listinfo/development
> >> >> >
> >> >> _______________________________________________
> >> >> Development mailing list
> >> >> Development at opentox.org
> >> >> http://www.opentox.org/mailman/listinfo/development
> >> >>
> >> > _______________________________________________
> >> > Development mailing list
> >> > Development at opentox.org
> >> > http://www.opentox.org/mailman/listinfo/development
> >> >
> >>
> >>
> >>
> >> --
> >> Surajit Ray
> >> Partner
> >> www.rareindianart.com
> >> _______________________________________________
> >> Development mailing list
> >> Development at opentox.org
> >> http://www.opentox.org/mailman/listinfo/development
> >>
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >
>
>
>
> --
> Surajit Ray
> Partner
> www.rareindianart.com
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>