[OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms

Mon Dec 13 09:09:54 CET 2010

Hi Surajit, All,

On 4 December 2010 14:26, surajit ray <mr.surajit.ray at gmail.com> wrote:

> Hi Nina,
>
> Heres a question that Christoph asked in the comments under Model API.
> which makes a good case for having features sets and assigning them to
> datasets.
>
> To Quote ---->
>
> URI returned on Model POST
> Posted by Helma Christoph at Oct 01, 2009 09:07 PM
> My predictions return not only a prediction_feature, but a lot of
> additional information (similarities, neighbors, substructures with
> statistical significance, etc) that do not fit very well into our
> dataset definition (they are in fact an aggregation of datasets and
> features). Any suggestions how to deal with such a situation?
>
> ----- End of Quote
>
>
All this additional information could be represented as more features and/
or annotating features with additional information.  There is no restriction
that a single feature is returned for prediction.  For example, we return
multiple features for all Toxtree modules.

For relationships to a compound, during the summer and in the context of
QPRF, Pantelis was proposing new entry in opentox ontology , representing
related compounds . Such relation could represent similarity , reaction
products, etc.   and could again follow the same logic to point to a
generating algorithm.

It's similar to what one would do to represent all that information in an
Excel table - one would usually add yet another column and write down
similarity or substructures' statistical significance there.  It's just that
we can assign more information to the column header (Feature object), rather
than a single string as column title.

If it is similar structures we want to report, it fits quite well to the
current scheme - there is a similarity algorithm that generates the list of
similar structures (and they will be different, if a different algorithm is
used) , so it is critical to keep the information which algorithm was
generating these.  The similarity algorithm gets a compound or dataset as
input parameter and returns a dataset with the similar structures.  The
feature that represents the similarity is a regular feature with hasSource
pointing to the algorithm.   What is missing is a way to record the same
information for the structures themselves, thus a new class, representing
the relationship between compounds will help.

Representing substructures as features makes also easy to assign statistical
significance - this is just a Feature Value for specific feature and
compound (similar to MCSS example, but a number, instead of  Yes/ No ).

Regards,
Nina

> Regards
> Surajit
>
> On 2 December 2010 19:48, Nina Jeliazkova <jeliazkova.nina at gmail.com>
> wrote:
> > On 2 December 2010 15:56, surajit ray <mr.surajit.ray at gmail.com> wrote:
> >
> >> Hi Nina,
> >>
> >> Well well ... I have suggested a method other than my previous "hack"
> >> ... which is a featureset with a featuresetValue. Did you forget to
> >> read that ?
> >>
> >>
> >> On 2 December 2010 19:14, Nina Jeliazkova <jeliazkova.nina at gmail.com>
> >> wrote:
> >> > Hi Surajit,
> >> >
> >> >
> >> > "Most industry chemoinformatics is quite irreproducable - but is paid
> >> > for an viable ! Also the onus to reproduce the results are with the
> >> > user - not with the datasets and algorithms in the services."
> >> >
> >> >
> >> > This is exactly what we are struggling to overcome in OpenTox - thus,
> if
> >> we
> >> > don't agree on this point, I don't see the point of the discussion and
> >> will
> >> > leave the consensus for others.
> >>
> >> Are we ? Has our focus shifted from provided REACH complaint toxicity
> >> prediction - to generic chemoinformatics abstractions and automatic
> >> reproducability (which any way we are very far from presently) ?
> >>
> >>
> > Yes, we are. Otherwise, there is no point of building interoperable
> > component based system, based on ontologies.   There are plenty of
> systems
> > that produce toxicity predictions, there is no point of building another
> > one, if it doesn't try to overcome at least some of the known issues.
> >
> > A comparable example is - the onus to reproduce the results is with
> >> the researcher (in a wet-lab) not with the test tubes and pipettes and
> >> petri dishes.
> >>
> >
> >
> > I disagree.  The reproducibility is because of the protocol ( test tubes,
> > chemicals and how they are used) , and wrt in-silico predictions, the
> > protocol is described by algorithms and data and how they are used ,  how
> > well this is explained in unambiguous way in order to be able to be
> > repeated.
> >
> >
> >>
> >> > Having a hack for something which is demonstrated to have a solution
> with
> >> > the current API doesn't make sense to me (IMHO, IMHO).
> >>
> >> Please check the rest of the previous mail for my suggested solution.
> >>
> >
> > As I said, I will leave the final consensus to other partners, obviously
> I
> > am biased here.
> >
> >
> >>
> >> And your solution although ingenious and well thought out - introduces
> >> unnecessarily complicated methods that could be very easily
> >> represented more simply.
> >>
> >
> > No, it is exactly in line with the entire framework, nothing ingenious.
> >
> >>
> >> So the question after all this is are we open to "any" changes OR are
> >> we looking at a cap on API 1.2. If thats the case then you actually
> >> set the stage for all manner of hacks. Ours was just one use case
> >> which does not fit the present API easily. There will be more ... and
> >> btw making a model to store features is also a "hack". Albeit a clever
> >> one.
> >>
> >>
> > Again, this is not a hack, this is exactly how algorithms and models were
> > designed to work in this framework.
> > BTW, TUM fragment based algorithms work in very similar manner for almost
> an
> > year now, there is nothing new in this respect.
> >
> > Would be nice to hear other opinions in this thread, I'll keep silent on
> > this topic.
> >
> > Regards,
> > Nina
> >
> >
> >> Regards
> >> Surajit
> >>
> >> > Regards,
> >> > Nina
> >> >
> >> > On 2 December 2010 15:34, surajit ray <mr.surajit.ray at gmail.com>
> wrote:
> >> >
> >> >> Hi Nina,
> >> >>
> >> >> To organise the discussions better, I have created a new page to
> >> >> capture the discussion on featuresets.
> >> >> http://opentox.org/dev/apis/api-1.2/featureset-and-workarounds
> >> >>
> >> >> I have moved your discussion points on to  this page from the
> >> >> substructure wishlist page ....
> >> >>
> >> >> On 30 November 2010 18:43, Nina Jeliazkova <
> jeliazkova.nina at gmail.com>
> >> >> wrote:
> >> >> > No problem to extend the API to be able to group features. (In fact
> we
> >> >> have
> >> >> > this implemented , even with hierarchical grouping
> >> >> > http://apps.ideaconsult.net:8080/ambit2/template/Taxonomy , turns
> to
> >> be
> >> >> > quite useful for ToXML representation ). Could be documented and
> >> included
> >> >> in
> >> >> > the API.
> >> >>
> >> >> Yeah lets have featuresets please ...
> >> >>
> >> >> > IMHO there is no sense to assign a feature or a featureset to a
> >> dataset ,
> >> >> > without specifying what is the relationship between dataset and
> >> features.
> >> >> > This is perfectly served by algorithm/model approach so far.
> >> >> >
> >> >>
> >> >> We can capture an explicit relationship in a "FeaturesetValue" every
> >> >> time we assign a "Featureset" to a dataset. Its explicit, simpler and
> >> >> we can even put the URI of the creating algorithm in the
> >> >> FeaturesetValue.
> >> >>
> >> >> > The dummy dataset suggestion is a hack , which lack consistency and
> I
> >> am
> >> >> not
> >> >> > in favour of it.
> >> >>
> >> >> I guess the same can be said of Christoph's method of assigning
> >> >> substructures to compounds to actually just capture the substructure
> >> >> set in one dataset. On the flip side - every API that I have worked
> >> >> with (Google Maps API, Facebook API, Facebook Graphs API, Flex/Flash)
> >> >> has a "hack" which became the norm. IHMO in this case though the best
> >> >> solution is to have a Featureset with a FeaturesetValue to explicitly
> >> >> outline the relationship to the dataset.
> >> >>
> >> >>
> >> >> >
> >> >> > See above for sets of features.  What we would like to have more
> than
> >> >> other
> >> >> > libraries in OpenTox is to be able to tell how these features have
> >> been
> >> >> > calculated.
> >> >> >
> >> >> > The CDK fingerprinter ( if you mean
> >> >> > org.openscience.cdk.fingerprint.Fingerprinter ) is not a good
> example
> >> >> here,
> >> >> > since it uses hashed fingerprints, which is almost impossible to
> >> >> translate
> >> >> > to SMARTS.
> >> >> >
> >> >> >  The CDK does very good job for specifying descriptors metadata via
> >> >> > ontology, but this is not (yet?) done for fingerprinting (as far as
> I
> >> >> know),
> >> >> > although fingerprinter algorithm could be included in BlueObelisk
> or
> >> >> ChemInf
> >> >> > ontology the same way descriptor algorithms are.
> >> >> >
> >> >> >
> >> >>
> >> >> >
> >> >> > Sorry, this was my impression from earlier discussions.
> >> >> >
> >> >> > Look at my examples, this is exactly what comes from the MCSS
> model.
> >> >> >
> >> >> > Having single compound with substructures, assigned as features is
> >> >> > inconsistent for the following reason.
> >> >> >
> >> >> > The meaning of the set of substructures is that they have been
> >> obtained
> >> >> by
> >> >> > MCSS (or fingerprinting algorithm) , and are MCSS structures for
> the
> >> >> entire
> >> >> > dataset.  Assigning them to a single dummy compound means all this
> >> >> > information is lost.
> >> >>
> >> >> W.R.T the hack - yes we lose the information - and yet its many times
> >> >> simpler than creating a model just to represent a set of features
> >> >> (substructures).
> >> >>
> >> >> A featureset with featuresetValue solves this problem well, without
> >> >> resorting to needless model building step or the "hack".
> >> >> >
> >> >>
> >> >> >
> >> >> > Having fragments submitted to another fingerprinter algorithm,
> which
> >> by
> >> >> > definition works on whole compounds is essentially mixing
> >> substructures
> >> >> with
> >> >> > compounds.  What if the fingerprinter algorithm starts to normalize
> >> the
> >> >> > fragments as if they are compounds?
> >> >>
> >> >> The fingerprinter in this case is going to take two inputs - dataset
> >> >> and featureset(substructureset). Again wheres the question of mixing
> >> >> the two ?
> >> >>
> >> >> >
> >> >>
> >> >> >>
> >> >> >
> >> >> > curl -X GET /model/id/predicted  gives you list of features, URLs
> or
> >> RDF
> >> >> > .There is no need to extract anything.  We could easily add a new
> mime
> >> >> type
> >> >> > to support SMARTS for feature representation (whenever relevant)
> and
> >> >> you'll
> >> >> > get list of smarts by something like
> >> >> >
> >> >> > curl -X GET -H "Accept:chemical/x-smarts"  /model/id/predicted
> >> >> >
> >> >> > (Hm,  is there MIME format for SMARTS )
> >> >> >
> >> >> > Besides,  the current scheme supports ANY kind of fingerprinter,
> >> >> regardless
> >> >> > if it extracts fragments in the form of SMILES/SMARTS or just
> report
> >> some
> >> >> > encoded strings (as PubChem fingerprinter) or un-interpretable bits
> >> (as
> >> >> > hashed fingerprints).
> >> >>
> >> >> Smarts Mime type on the model/id/predicted URL can provide a list of
> >> >> features - but it is a non-generic way of representing the set of
> >> >> features. It is non-generic to imagine that a fingerprinting
> algorithm
> >> >> takes a model/id/predicted as input - especially since the "model"
> may
> >> >> not have any relationship with the fingerprinter.
> >> >>
> >> >>
> >> >> >
> >> >> > No, not a dataset comprising of features, but a dataset, comprising
> of
> >> >> > compounds.
> >> >> >
> >> >> > You could define an algorithm to have list of features as input
>  into
> >> a
> >> >> new
> >> >> > dataset of compounds , if there is meaningful way to do so (e.g.
> for
> >> >> smarts
> >> >> > based features).  This means there will be no assumption features
> are
> >> >> > compounds, but a documented service that does the conversion in a
> >> known
> >> >> way.
> >> >>
> >> >> Why would we need to convert features to compounds ?
> >> >>
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> > curl -X POST /algorithm/features2dataset -d
> >> >> >> > "feature_uris[]=/model/mcss1/predicted"   ->
> >> >> >> > /dataset/newdatasetfromfeatures
> >> >> >> >
> >> >> >> > Then you are done, POST the dataset into other algorithms as
> usual.
> >> >> >>
> >> >> >> I am sorry but I could not understand how a dataset will be
> created
> >> in
> >> >> >> this case.
> >> >> >>
> >> >> >>
> >> >> > Setup an algorithm service, which will  read the features, find if
> a
> >> >> feature
> >> >> > is a substructure,  generate compounds for them (e.g. SDF  file)
>  and
> >> >> post
> >> >> > the SDF content to a dataset service - thus it will create a
> dataset.
> >> >>
> >> >> Again I could not get the requirement for such a service - I would
> >> >> like to capture "substructures" (as features) not convert them to
> >> >> compounds.
> >> >>
> >> >>
> >> >> >
> >> >> > As well for individual compounds.
> >> >> >
> >> >> >
> >> >>
> >> >> >
> >> >> > IMHO it is not easy to understand  what means "assigning" feature
> set
> >> to
> >> >> a
> >> >> > dataset?  It doesn't tell neither where the feature set came from,
> nor
> >> >> how
> >> >> > it is related to a dataset.  This is all lost of information, which
> >> all
> >> >> > contributes to the poor reproducibility of any models.
> >> >>
> >> >> Yeah a featuresetValue to capture that explicit value of the
> >> >> relationship is just what is needed here.
> >> >>
> >> >> > So far OpenTox has a very simple and logical API (yes, I have heard
> >> this
> >> >> > from external developers) - Datasets are processed by
> >> algorithms/models
> >> >> and
> >> >> > written to datasets - that's all basically.
> >> >> >
> >> >>
> >> >> And yet we you want to build this complex logic of making a Model to
> >> >> essentially store a set of features ?
> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > Just need a feature set , regardless of how it was obtained ...
> that's
> >> >> how
> >> >> > irreproducible cheminformatics models are born...
> >> >>
> >> >> Most industry chemoinformatics is quite irreproducable - but is paid
> >> >> for an viable ! Also the onus to reproduce the results are with the
> >> >> user - not with the datasets and algorithms in the services.
> >> >>
> >> >> >
> >> >> >
> >> >> > If you use dummy dataset with dummy compounds, you are introducing
> a
> >> mess
> >> >> > into datasets service.  Because those dummy compounds and features,
> >> which
> >> >> > are not really  features for that compound will appear as a result
> of
> >> >> > searches , hitting that compound.
> >> >>
> >> >> So essentially you are saying we cannot have any datasets with some
> >> >> feature values as "false" or 0 to denote absence in a compound ?
> >> >>
> >> >> >
> >> >> >> Ideally however I still maintain its important to have
> featuresets.
> >> >> >>
> >> >> >
> >> >> > Feature sets alone are fine, see above.
> >> >>
> >> >> Then lets go for it. We have thought a lot about it and the indirect
> >> >> methods suggested till now - just seem to be an attempt to stonewall
> >> >> any big changes in API. Which from the perspective of an API is hara
> >> >> kiri ? Google Maps API released 3 major versions in 5 years. By
> >> >> comparison our API upgrades are just moving from 1.1 to 1.2 in 2
> >> >> years, with barely any changes ...
> >> >>
> >> >> >
> >> >> > A generic API means it could be applied to great variety of
> problems.
> >> >> > Specific solutions introduce incompatibility.
> >> >> > It's a generic computer science approach - try to abstract things,
> >> break
> >> >> > larger problem into smaller pieces, find the commonalities.  That's
> >> how
> >> >> IT
> >> >> > works ...
> >> >>
> >> >> So are we doing chemoinformatics (vs just toxicity) ? Are we
> >> >> representing atomic features ? If both your answers are "no" then we
> >> >> can safely say we have not abstracted enough !
> >> >>
> >> >> >
> >> >> > Having your simple information represented in a way specific for
> your
> >> >> > problem doesn't make things compatible ... Now everybody could use
> >> >> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss to
> retrieve
> >> MCSS
> >> >> > structures from dataset of their choice and then run it through any
> of
> >> >> Weka
> >> >> > algorithms available by any partner.  Will your approach do
> anything
> >> >> similar
> >> >> > ?
> >> >>
> >> >> I can't see a point to debate here ... I am looking for a generic
> >> >> solution for collecting features. This  can be achieved quite simply
> >> >> by having a featureset and a featuresetValue (when assigning to a
> >> >> dataset) to explicitly capture the value of the relationship. The
> >> >> beauty is we do not need a "model" in the middle to just capture some
> >> >> explicit relationships.
> >> >>
> >> >>
> >> >> Regards
> >> >> Surajit
> >> >>
> >> >> > Regards,
> >> >> > Nina
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> Regards
> >> >> >> Surajit
> >> >> >>
> >> >> >>
> >> >> >> > Well, my point of view is that an algorithm applied to specific
> >> data
> >> >> with
> >> >> >> > specific parameters should be considered a model (descriptor
> >> >> calculations
> >> >> >> > included).  An algorithm is just abstract sequence of steps,
> when
> >> one
> >> >> >> > applies it to data with specific parameters, then a model is
> >> >> generated.
> >> >> >> > This will make the API much more consistent (now some algorithms
> >> >> generate
> >> >> >> a
> >> >> >> > model, and results of other algorithms is a dataset, which is
> quite
> >> >> >> > confusing for external developers). But at this point  I am not
> >> >> insisting
> >> >> >> on
> >> >> >> > changing the   API that far ;)
> >> >> >> >
> >> >> >> > Regards,
> >> >> >> > Nina
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> Regards
> >> >> >> >> Surajit
> >> >> >> >>
> >> >> >> >> > Regards,
> >> >> >> >> > Nina
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> Cheers
> >> >> >> >> >> Surajit
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On 29 November 2010 14:05, Nina Jeliazkova <
> >> >> >> jeliazkova.nina at gmail.com>
> >> >> >> >> >> wrote:
> >> >> >> >> >> > Dear Christoph, Surajit, All,
> >> >> >> >> >> >
> >> >> >> >> >> > This discussion is very useful.
> >> >> >> >> >> >
> >> >> >> >> >> > As a result of myself trying to understand both points of
> >> view,
> >> >> >>  now
> >> >> >> >> we
> >> >> >> >> >> have
> >> >> >> >> >> > MCSS algorithm as ambit service  (thanks to CDK SMSD
> >> package).
> >> >> >> >> >> >
> >> >> >> >> >> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss
> >> >> >> >> >> >
> >> >> >> >> >> > It can be applied to a dataset and generates a model,
> where
> >> >> >> predicted
> >> >> >> >> >> > features (MCSS in this case) are available via
> >> >> >> ot:predictedVariables
> >> >> >> >> >> > (example
> >> >> >> >> https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted
> )
> >> >> >> >> >> > The features use current API, without any change (although
> >> >> having
> >> >> >> >> >> > ot:Substructure subclass of ot:Feature will make it more
> >> clear).
> >> >> >> >> >> >
> >> >> >> >> >> > All the MCSS substructures can be used by any learning
> >> algorithm
> >> >> ,
> >> >> >> as
> >> >> >> >> >> they
> >> >> >> >> >> > are standard ot:Features.
> >> >> >> >> >> >
> >> >> >> >> >> > Here are more details and proposal (start from
> *Substructure
> >> API
> >> >> >> >> proposal
> >> >> >> >> >> > heading *)
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist
> >> >> >> >> >> >
> >> >> >> >> >> > Best regards,
> >> >> >> >> >> > Nina
> >> >> >> >> >> >
> >> >> >> >> >> > P.S. Please note the /mcss algorithm might be slow for
> large
> >> >> >> datasets,
> >> >> >> >> >> there
> >> >> >> >> >> > are several improvements that we'll be applying
>  performance
> >> >> wise,
> >> >> >> but
> >> >> >> >> >> this
> >> >> >> >> >> > will not change the API .
> >> >> >> >> >> >
> >> >> >> >> >> > On 25 November 2010 18:13, Christoph Helma <
> >> helma at in-silico.ch>
> >> >> >> >> wrote:
> >> >> >> >> >> >
> >> >> >> >> >> >> Excerpts from surajit ray's message of Thu Nov 25
> 14:49:19
> >> >> +0100
> >> >> >> >> 2010:
> >> >> >> >> >> >>
> >> >> >> >> >> >> > > This type of representation (we are using it
> internally)
> >> >> has
> >> >> >> >> served
> >> >> >> >> >> >> well
> >> >> >> >> >> >> > > for our datasets which might contain also several
> >> (10-100)
> >> >> >> >> thousand
> >> >> >> >> >> >> > > substructures for a few thousands compounds. I also
> do
> >> not
> >> >> >> think,
> >> >> >> >> >> that
> >> >> >> >> >> >> > > the representation is redundant:
> >> >> >> >> >> >> > >        - each compound is represented once
> >> >> >> >> >> >> > >        - each substructure is represented once
> >> >> >> >> >> >> > >        - each association between compound and
> >> substructure
> >> >> is
> >> >> >> >> >> >> represented once
> >> >> >> >> >> >> > > Please correct me, if I am missing something obvious.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > According to this representation each dataEntry for a
> >> >> compound
> >> >> >> will
> >> >> >> >> >> >> > have to have all substructure features that were found
> in
> >> >> them.
> >> >> >> >> >> >> > Therefore each dataEntry may have 1000-10000
> >> >> >> feature/featureValue
> >> >> >> >> >> >> > pairs . For 500 datasentries that means on an average
> of
> >> >> >> >> >> >> > 500*5000(assuming 5000 substructures) = 2,500,000
> >> >> >> >> feature/featureValue
> >> >> >> >> >> >> > pairs - thats 2.5 million !
> >> >> >> >> >> >>
> >> >> >> >> >> >> In our case it is a lot less (not completely sure about
> your
> >> >> >> feature
> >> >> >> >> >> >> types), because only a very small subset of features
> occurs
> >> in
> >> >> a
> >> >> >> >> single
> >> >> >> >> >> >> compound.
> >> >> >> >> >> >>
> >> >> >> >> >> >> > versus just having a featureset with a
> >> >> >> >> >> >> > 5000 feature entries. You can imagine the difference in
> >> cost
> >> >> of
> >> >> >> >> >> >> > bandwidth,computation etc.
> >> >> >> >> >> >>
> >> >> >> >> >> >> I am not sure, if I get you right, but where do you want
> to
> >> >> store
> >> >> >> the
> >> >> >> >> >> >> relationships between features and compounds? If there
> are
> >> >> really
> >> >> >> 2.5
> >> >> >> >> >> >> million associations you have to assert them somewhere.
> And
> >> >> having
> >> >> >> >> >> features
> >> >> >> >> >> >> without compounds seems to be quite useless for me.
> >> >> >> >> >> >>
> >> >> >> >> >> >> > >
> >> >> >> >> >> >> > > Adding "false" occurences would not violate the
> current
> >> API
> >> >> >> (but
> >> >> >> >> >> would
> >> >> >> >> >> >> > > add redundant information). Keep in mind that the
> >> dataset
> >> >> >> >> >> >> representation
> >> >> >> >> >> >> > > is mainly for exchanging datasets between services -
> >> >> >> internally
> >> >> >> >> you
> >> >> >> >> >> can
> >> >> >> >> >> >> > > use any datastructure that is efficient for your
> >> purposes
> >> >> (we
> >> >> >> >> also
> >> >> >> >> >> do
> >> >> >> >> >> >> > > that in our services). So if you need fingerprints
> >> >> internally,
> >> >> >> >> >> extract
> >> >> >> >> >> >> > > them from the dataset.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Internalizing an intermediate step completely serves
> the
> >> >> purpose
> >> >> >> >> but
> >> >> >> >> >> >> > leads to less flexible design paradigms. If we
> internalize
> >> >> the
> >> >> >> >> >> >> > workflow from substructure extraction to fingerprinting
> -
> >> we
> >> >> >> will
> >> >> >> >> lose
> >> >> >> >> >> >> > the ability to provide the data to a third party server
> >> for
> >> >> an
> >> >> >> >> >> >> > independent workflow. Of course the reasoning could be
> >> "who
> >> >> >> needs
> >> >> >> >> it
> >> >> >> >> >> >> > ?" - well you never know !!
> >> >> >> >> >> >>
> >> >> >> >> >> >> I am very interested in exchanging "fingerprints" with
> other
> >> >> >> >> services,
> >> >> >> >> >> >> but that can be done already with the current API. I see
> >> >> >> fingerprints
> >> >> >> >> as
> >> >> >> >> >> >> sets of features that are present in a compound (also
> using
> >> set
> >> >> >> >> >> >> operations to calculate similarities), and find it fairly
> >> >> >> >> >> >> straightforward to parse/serialize them to/from datasets.
> >> >> >> >> >> >>
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > >> I still suggest having a FeatureSet/SubstructureSet
> >> type
> >> >> >> object
> >> >> >> >> >> within
> >> >> >> >> >> >> > >> the API to make it convenient to club features
> without
> >> >> >> compound
> >> >> >> >> >> >> > >> representations.
> >> >> >> >> >> >> > >
> >> >> >> >> >> >> > > I prefer to keep the API as generic as possible and
> not
> >> to
> >> >> >> >> introduce
> >> >> >> >> >> >> > > ad-hoc objects (or optimizations) for special
> purposes -
> >> >> >> >> otherwise
> >> >> >> >> >> it
> >> >> >> >> >> >> > > will be difficult to maintain services in the long
> term.
> >> >> Why
> >> >> >> >> don't
> >> >> >> >> >> you
> >> >> >> >> >> >> > > use ontologies for grouping features?
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Grouping features using ontologies is clubbing the
> >> features
> >> >> Not
> >> >> >> the
> >> >> >> >> >> >> > feature values
> >> >> >> >> >> >>
> >> >> >> >> >> >> But you cannot have feature values without relating
> features
> >> to
> >> >> >> >> >> >> compounds. If you use the representation I proposed
> feature
> >> >> values
> >> >> >> >> are
> >> >> >> >> >> >> "true" anyway.
> >> >> >> >> >> >>
> >> >> >> >> >> >> > So how do we know mcss3 occuring in compound X is with
> >> >> respect
> >> >> >> to
> >> >> >> >> >> >> > which compound. As you said we can have arbitary fields
> in
> >> >> the
> >> >> >> >> feature
> >> >> >> >> >> >> > definitions (for MCSS) - but that would be outside API
> >> >> >> definitions.
> >> >> >> >> >> >>
> >> >> >> >> >> >> features:
> >> >> >> >> >> >>        mcss3:
> >> >> >> >> >> >>                ot:componds:
> >> >> >> >> >> >>                        - compound2
> >> >> >> >> >> >>                        - compound3
> >> >> >> >> >> >>                ot:smarts: smarts3
> >> >> >> >> >> >>
> >> >> >> >> >> >> In my understanding you can add any annotation you want
> to a
> >> >> >> feature.
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> > Yes, you can, but if this is not an agreed annotation,  no
> >> other
> >> >> >> >> service
> >> >> >> >> >> > will understand it.
> >> >> >> >> >> >
> >> >> >> >> >> > Best regards,
> >> >> >> >> >> > Nina
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> >>  Best regards,
> >> >> >> >> >> >> Christoph
> >> >> >> >> >> >> _______________________________________________
> >> >> >> >> >> >> Development mailing list
> >> >> >> >> >> >> Development at opentox.org
> >> >> >> >> >> >> http://www.opentox.org/mailman/listinfo/development
> >> >> >> >> >> >>
> >> >> >> >> >> > _______________________________________________
> >> >> >> >> >> > Development mailing list
> >> >> >> >> >> > Development at opentox.org
> >> >> >> >> >> > http://www.opentox.org/mailman/listinfo/development
> >> >> >> >> >> >
> >> >> >> >> >> _______________________________________________
> >> >> >> >> >> Development mailing list
> >> >> >> >> >> Development at opentox.org
> >> >> >> >> >> http://www.opentox.org/mailman/listinfo/development
> >> >> >> >> >>
> >> >> >> >> > _______________________________________________
> >> >> >> >> > Development mailing list
> >> >> >> >> > Development at opentox.org
> >> >> >> >> > http://www.opentox.org/mailman/listinfo/development
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Surajit Ray
> >> >> >> >> Partner
> >> >> >> >> www.rareindianart.com
> >> >> >> >> _______________________________________________
> >> >> >> >> Development mailing list
> >> >> >> >> Development at opentox.org
> >> >> >> >> http://www.opentox.org/mailman/listinfo/development
> >> >> >> >>
> >> >> >> > _______________________________________________
> >> >> >> > Development mailing list
> >> >> >> > Development at opentox.org
> >> >> >> > http://www.opentox.org/mailman/listinfo/development
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Surajit Ray
> >> >> >> Partner
> >> >> >> www.rareindianart.com
> >> >> >> _______________________________________________
> >> >> >> Development mailing list
> >> >> >> Development at opentox.org
> >> >> >> http://www.opentox.org/mailman/listinfo/development
> >> >> >>
> >> >> > _______________________________________________
> >> >> > Development mailing list
> >> >> > Development at opentox.org
> >> >> > http://www.opentox.org/mailman/listinfo/development
> >> >> >
> >> >> _______________________________________________
> >> >> Development mailing list
> >> >> Development at opentox.org
> >> >> http://www.opentox.org/mailman/listinfo/development
> >> >>
> >> > _______________________________________________
> >> > Development mailing list
> >> > Development at opentox.org
> >> > http://www.opentox.org/mailman/listinfo/development
> >> >
> >>
> >>
> >>
> >> --
> >> Surajit Ray
> >> Partner
> >> www.rareindianart.com
> >> _______________________________________________
> >> Development mailing list
> >> Development at opentox.org
> >> http://www.opentox.org/mailman/listinfo/development
> >>
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >
>
>
>
> --
> Surajit Ray
> Partner
> www.rareindianart.com
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>