[OTDev] Uploading non-standard datasets

surajit ray mr.surajit.ray at gmail.com
Fri Oct 1 17:41:39 CEST 2010


Hi,

Seems like I have yet get the point across in entirety. So lets examine
first of all the feature approach you are suggesting.

If a feature can only be assigned to a compound then the following will
happen :

a) the source Dataset will create a set of substructure features (maybe
10000 or more) via stage 1 of the Maxtox algorithm.
b) Each feature will be a substructure (represented in compound form ?)
associated with multiple compounds.
c) Each feature will be in a Dataset if the feature (substructure) appeared
while doing stage 1 (Maxtox) on the dataset.
d) In stage 2 each of the compounds in the Dataset will have to be
fingerprinted by checking which of the features (mentioned in the result of
Stage 1) appears in the compound.
e) this fingerprint will then be another feature of the compound stored by
another dataset.
f) the fingerprints are then used for model building and prediction.

The other way to do this would be to :
1) get all the substructures (as compounds ?)from stage 1 and store them in
a dataset.
2) use this dataset to fingerprint the compounds within the source Dataset
3) use the fingerprints (features) in model building and prediction


The disadvantage of method 1 is that it is convoluted and counter intuitive
1) each time a set of substructures are generated a corresponding number of
features will have to generated first.
2) After this process each feature will have to be checked to see if there
is an existing representation (in feature form )
3) if yes then the feature will have to sought in all the compounds of the
dataset and the compounds that contain this feature will have to be
associated with the feature using a dataentry
4) if no then the feature will be created and then sought in the all the
compounds and respective compounds updated
5) the resulting dataset from the above operations will be the source
Dataset compounds with a variable bunch of features (substructures)
associated with them.
6) when we try to do stage 2 of Maxtox - this dataset (from the previous
step) will have to be parsed to retrieve all the features and arrange them
sequentially (an order is ok - may be atom size)
7) the compounds will have a fingerprint (feature) based on the
existence/absence of a certain substructure. Which will be used for the
predictions

As you can see that the intermediate dataset created is not much help ,
except maybe speeding up the fingerprinting process ( since the
substructures corresponding to a compound will be stored) . Even then there
is the additional step of finding the feature from the intermediate dataset
and matching it to the ones stored in sequence to find the correct
fingerprint of the compound.

Secondly when you consider that when each feature (out of many thousands) is
checked/created over internet and multiple accesses performed - it will take
down the reliability of the process by many degrees.



On 30 September 2010 13:48, Nina Jeliazkova <jeliazkova.nina at gmail.com>wrote:

> Hi All,
>
> On Mon, Sep 27, 2010 at 6:03 PM, chung <chvng at mail.ntua.gr> wrote:
>
> > On Mon, 2010-09-27 at 20:03 +0530, surajit ray wrote:
> > > Hi,
> > >
> > > Well having them as features will not cut it - for the simple reason
> > > that the "feature" in this case belongs to the the input dataset (or
> > > whichever set is being worked upon).
> >
>
> No, a feature does not belong to a dataset, but might be associated with
> one
> or more datasets (or none).
>
>
A feature ( which is another dataset consisting of fragments) in our case
needs to "belong" to a dataset because that immensely simplifies the process
and is intuitive wrt to the original workflow


> >
> > As far as I know as it is generally conceived in OpenTox and as far as
> > the implementation in AMBIT is concerned, features are separated from
> > datasets and can be standalone.
>
>
> Exactly.  Features are standalone objects, can be created by POSTing to
> /feature service and the only connection is via ot:DataEntry  , used in
> ot:Datasets. A feature could be used by multiple datasets.
>
> That is, you can have a feature that
> > does not appear in any datasets.
>
> You might only have a pointer to a
> > dataset using the object property 'ot:hasSource' but this does not
> > somehow bind the feature to the dataset. However, I'm not sure if I
> > understood well.
> >
>
>
> This is correct.
>
>
> >
> > > The compound itself may not have
> > > a substructure but it may be a a part of a dataset which when examined
> > > will have the substructure appearing while doing the pairwise
> > > comparisons.
> >
>
> In the current setup,  the substructures are features, as any other
> properties. If a compound has no such substructure, it may simply have no
> such feature  assigned via ot:DatasetEntry.   This gives universal access
> to
> learning algorithm which can work with any kind of features, including
> substructures.  (e.g. there is no need to tell SVM algorithm if a feature
> is
> a substructure or anything else).
>
>
>
lets consider a setup of around 10000 features. To manage the above each
compound will have to checked against the 10000 features (generated or
found) over the internet - which means 10000 times number of compounds =
total number of request over the internet. I don't think even in first world
countries the internet is that stable for this to work stably - it will more
often fail.

Our real useful feature is the fingerprint.



> >
> > If you need to establish a relationship between a such a feature and a
> > compound (so that given the feature you can retrieve the
> > fragment/compound to which it refers in any supported MIME type), then
> > we can extend the range of the property ot:hasSource to include also
> > ot:Compound and assign a compound URI to such features. i.e. something
> > like:
> >
> > /feature/123
> >        a ot:Feature
> >        ot:hasSource /compound/435
> >
>
>
> We can simply use the current construct to declare ot:hasSource point to
> algorithm, that verifies presence of substructures, or finds substructures
> by any other means.  Recall the purpose of ot:hasSource is to be able to
> regenerate the feature, when applied to new compound.  If it points to an
> algorithm generating the feature is straightforwards, if pointing to a
> compound, it will not be possible.
>
>
> In this case both algorithm and the source dataset will determine the
source of the features (substructures)


> >
> > But then I'm not sure whether the following are also needed:
> >
> > 1. Declare that the feature above is a ot:SubstructureFeature (new) or
> > at least declare that it is boolean.
> >
>
> Extension of ot:Feature is reasonable, but it would be better if the
> algorithms, generating substructures are described in an ontology, as any
> other algorithms (e.g. descriptor calculation ones).
>
>
A descriptor calculator algorithm ties in neatly with the present framework.
But how about algorithms which are generating intermediate data - which
leads to descriptor calculation via another algorithm ?



>
> >
> > 2. Make it explicit that the above compound is a ot:Fragment (new)
> >
>
>
>
>
> >
> > Maybe we can go without introducing extra classes.
> >
> > >
> > > Using the features system in this manner is not (IHMO) the solution to
> > > this problem. It will be really be cumbersome to maintain the feature
> > > URIs which may number in many thousands and will be extremely
> > > transient. In effect it will be lot of resources being hogged by a
> > > system which could do with a much more simpler implementation.
> >
> > That is not really a problem. A feature is a very small entry in a
> > database. There are enterprises that maintain databases of some tens of
> > TeraBytes or even more.
> >
>
> Indeed.
>
> Besides, having substructures as features effectively introduces caching of
> substructures, which allows 1) avoid multiple calculation of the same
> feature 2) being able to see if same substructures are used /generated by
> different algorithms and even do some comparison.
> 3) visualisation and statistics over features , without having to have all
> of them in memory.
>
>
>
I guess we forgetting that our API is internet based and for an algorithm to
check and create more than 10000 features per run (on average) is going to
severely clamp down on the available bandwidth and server resources. Plus in
such a scenario lets say we give a search request to a server containing the
substructure features - we ask does substructure a exists in the database ?
- to give an answer a millin or more checks will have to made by the server
per request (considering that all known substructures may run into
millions). So for the server hosting the features - each algorithm run will
generate "grep" runs which will equal - 10000 * no of compounds * million
(total number of substructures)

For a set of 500 compounds that number is 10000 * 500 * 1000000 =
5,000,000,000,000 searches. Which I feel is moving into Cray computer
territory.


> >
> > > Moreover a certain feature in such a system will be a part of a
> > > compound if its a part of Dataset A and may not be a part of the same
> > > compound when examined in Dataset B.
> > >
> >
> > This is true. For example if a compound does not contain C=O it is
> > obvious it will not contain CC=O or in general RC=O.
> >
>
> This is no problem currently,  features might be present in one dataset and
> not in another.
>
>
I would like to have a dataset (of substructures) which is a feature of the
source Dataset and hassource as stage 1 of maxtox algorithm. This immensely
simplifies the process - and makes life easier for all of us.


>
> >
> > > Summing up heres a few things I would like in the next API
> > >
> > > a) Ability to upload bulk compounds from scratch, using a dataset
> > > construct (and not posting single compounds)
> >
> > I think this is supported. You can POST a dataset with a set of new
> > compounds. If one or more compounds are not found in the database of the
> > server they should be created.
> >
>
>
> Indeed, this is the operation most often used now.  Compounds are created,
> if not found in the database.
>
> Great !


>
> >
> > > b) Ability to assign features to datasets
> >
> > You mean "to append" features or have some structured meta information
> > about the dataset itself?
> >
>
> If you PUT an  RDF, containing new features to an existing dataset, they
> will be appended to the dataset, and /dataset/id/feature  will return the
> full set of features
>
>
> Does it mean that feature will be associated with the Dataset and not to
the individual compounds ?



> >
> > > c) Ability to have non-standard datasets/compounds which contain
> > > substructures rather than molecules.
> >
>
> In fact, if smiles/ mol /sdf files, containing substructures are uploaded ,
> they will be saved as it is, but there will be available via /compound
> services.
>
> However, here are some of discussion points:
> - I don't think substructures and compounds should use the same dataset and
> compound API, these are semantically different resources
>

We could use another property of a compound (maybe called isFragment) or
something to determine the type of data.



> - If we consider substructures as properties of the compounds, it's more
> logical to have substructures as features, as we try to model any
> properties
> via this construct.
>

Makes sense logically, but for our case and as well as in the interest of
bandwidth utilization and server load (hosting these features) this method
is pretty much a no go.


> - What would be the benefit of having substructures as dataset?  I am not
> sure there is standard way to distinguish whether given SMILES should be
> represented as a substructure , rather than an entire compound.
>

Another field (property) called isFragment ?


> - As I understood, the substructures datasets is something used internally
> by MaxTox algorithm, and not necessarily exposed to the end user. Thus, is
> it really necessary to have it available via a dataset service?   Perhaps
> it
> is explained better in the manual Tobias is preparing , as it covers a
> similar case.
>
>
I am trying to expose the steps so as to enable an end user to create new
maxtox models.

Regards
Surajit



> Best regards,
> Nina
>
>
> > >
> > > Regards
> > > Surajit
> >
> > Best regards,
> > Pantelis
> > >
> > > On 27 September 2010 18:31, chung <chvng at mail.ntua.gr> wrote:
> > > > Hi Surajit,
> > > >   As far as I can understand you have a problem similar to the one I
> > > > was discussing with Alexey from IBMC. You need  a way to define which
> > > > substructures are present in a certain structure. For this purpose
> you
> > > > have to use features and not compounds. So you need a collection of
> > > > features each one of which corresponds to a certain substructure.
> > > > However in Ambit you can create a new compound by POSTing it
> > > > to /compound in a supported MIME (e.g. SMILES, SDF etc) for example
> > > > 'curl -X POST --data-binary @/path/to/file.sdf -H
> > Content-type:blah/blah
> > > > +sdf http://someserver.com/compound'. What is needed in OpenTox
> though
> > > > is a collection of substructures in a feature service and a way to
> > > > lookup for a certain feature according to its structure (e.g.
> providing
> > > > its SMILES representation).
> > > >
> > > > Best Regards,
> > > > Pantelis
> > > >
> > > > On Mon, 2010-09-27 at 14:18 +0530, surajit ray wrote:
> > > >
> > > >> Hi Nina,
> > > >>
> > > >> Need to upload some fragments (have smile representations) into a
> > > >> dataset. Is this possible in the current framework ?
> > > >>
> > > >> To be more elaborate -
> > > >> Currently I am uploading a dataset with compounds as the links to
> the
> > > >> respective compound URIs (which happens at the end of the online
> > > >> MaxtoxTest service). How would I upload new compounds (with
> smile/mol
> > > >> representations) ? And secondly if these (the upload set) happen to
> be
> > > >> fragments (and not molecules) is there a way to store such
> information
> > > >> using the ambit dataset service ?
> > > >>
> > > >> Thanx
> > > >> Surajit
> > > >> _______________________________________________
> > > >> Development mailing list
> > > >> Development at opentox.org
> > > >> http://www.opentox.org/mailman/listinfo/development
> > > >>
> > > >
> > > >
> > > > _______________________________________________
> > > > Development mailing list
> > > > Development at opentox.org
> > > > http://www.opentox.org/mailman/listinfo/development
> > > >
> > > _______________________________________________
> > > Development mailing list
> > > Development at opentox.org
> > > http://www.opentox.org/mailman/listinfo/development
> > >
> >
> >
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >
>
>
>
> --
>
> Dr. Nina Jeliazkova
> Technical Manager
> 4 A.Kanchev str.
> IdeaConsult Ltd.
> 1000 Sofia, Bulgaria
> Phone: +359 886 802011
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>



More information about the Development mailing list