[OTDev] Uploading non-standard datasets

Fri Oct 1 19:41:15 CEST 2010

Surajit, All,

Here is how I imagine MaxTox implementation, with current OpenTox API and in
three flavours - either with descriptor calculation hidden, or exposed as a
service and storing substructures in different level of details. This text
was initially inside the reply, but I hope having it in front of the message
will help to clarify misunderstandings we have so far (and be read by others
on the list as well ;) I believe this is applicable not only to MaxTox
models.

***************
Option 1. Hidden descriptor calculation
There is a single MaxTox algorithm resource encapsulating entire MaxTox
procedure.
1.1. Model creation

A dataset D1 is POSTed to a MaxTox algorithm A. The algorithm internally
creates dataset with fragments F1  and creates a new MaxTox model resource
M1, parameterized by dataset D1 (and eventually other parameters if
applicable).  The fragments dataset is only internally visible to the MaxTox
model M1.  The model M1 independent parameters are empty, the prediction
variable(s) are created as features and POSTed to the feature service and
stored into model ot:predictedVariables.

1.2. Model prediction

The MaxTox model M1 verifies if the compound has any of the (10000)
substructures it looks for, based on internally visible set of fragments F1
, runs its random forest algorithm and assigns the prediction result .
Predictions are represented as features in an RDF, the RDF is posted to a
dataset service.

***************

Option 2. MaxTox with descriptor calculation exposed as a service.
There are two type of MaxTox algorithm resources, representing stage 1
(descriptor calculation ) and stage 2 (model building)
2.1. Descriptor calculation

2.1.1. Fragment generation: A dataset D2 is POSTed to MaxTox descriptor
calculation algorithm A (i.e. stage 1). The algorithm internally creates
dataset with fragments F2  and creates a new MaxTox algorithm resource A2,
parameterized by dataset D2 (and eventually other parameters if
applicable).  The fragments dataset is only internally visible to the MaxTox
algorithm A2.

2.1.2. Descriptor calculation.  Compounds or datasets are posted to MaxTox
algorithm A2.  The algorithm A2 checks for all 10000 features, based on
(only internally visible) set of fragments F2, creates RDF with features,
representing those fragments and assigns 0 /1 values to the compounds in
question. Finally, it POSTs the RDF to a dataset service (same or different
dataset).

2.2. Model creation
A dataset D2 with features, created by MaxTox A2 algorithm is POSTed to
MaxTox random forest algorithm B (in fact to any other learning algorithm
which can handle 0/1 values).  A MaxTox model M2 is created and its URL
returned. M2 independent variables point to fragment features, as generated
by 2.1.2

2.3. Model prediction
A compound or  dataset D3 is posted to MaxTox M2 model service.  If
compound(s) have fragment features already calculated (the features are
known via ot:independentVariables) , they are used directly , otherwise, A2
descriptor calculation algorithm is applied (there is a pointer to it in
each feature ot:hasSource entry) . When all features are in place, the
random forest algorithm is applied. The result contains prediction results
as features and the RDF is posted back to a dataset service (same or a new
dataset).

(Note this procedure essentially caches the substructures, identified in
each compound. Thus, if 100 users try to run the same compound via MaxTox
M2, the substructure will be looked up only once, saving bandwidth and
processing time).

*******************

Option 3. MaxTox with fingerprint calculation exposed as a service
There are two type of MaxTox algorithm resources, representing stage 1
(fingerprint calculation ) and stage 2 (model building)
*It is essentially the same as Option 2, but stores single feature,
representing fingerprint, instead of multiple features, representing
individual substructures.*

3.1. Descriptor calculation
3.1.1. Fragment generation: A dataset D2 is POSTed to MaxTox descriptor
calculation algorithm A (i.e. stage 1). The algorithm internally creates
dataset with fragments F2  and creates a new MaxTox algorithm resource A2,
parameterized by dataset D2 (and eventually other parameters if
applicable).  The fragments dataset is only internally visible to the MaxTox
algorithm A2.

3.1.2. Descriptor calculation.  Compounds or datasets are posted to MaxTox
algorithm A2.  The algorithm A2 checks for all 10000 features, based on
(only internally visible) set of fragments F2, *creates RDF with a single
feature, representing a fingerprint of  those fragments and assigns
fingerprint value to the compound in question.* Finally, it POSTs the RDF to
a dataset service (same or different dataset).

3.2. Model creation
A dataset D2 with the feature, created by MaxTox A2 algorithm is POSTed to
MaxTox random forest algorithm B (*here is a difference with option 2, since
only MaxTox will understand what the fingerprint feature means*).  A MaxTox
model M2 is created and its URL returned.* M2 independent variables point to
the fingerprint feature*, as generated by 3.1.2

3.3. Model prediction
A compound  /dataset D3 is posted to MaxTox M2 model service.  If
compound(s) have* the fingerprint feature* already calculated (the feature
is known via ot:independentVariables) , it is used directly , otherwise, A2
descriptor calculation algorithm is applied (there is a pointer to it in the
fingerprint feature ot:hasSource entry) . When the fingerprint is in place,
the random forest algorithm is applied. The result contains prediction
results as features and the RDF is posted back to a dataset service (same or
a new dataset).

(Caching of substructures calculation is still valid here, as in Option 2).

Option 3 is indeed a good compromise, if there is no objective to allow
MaxTox fragment descriptors to be used by other learning algorithms.

*******************

More specific replies inline.

On Fri, Oct 1, 2010 at 6:41 PM, surajit ray <mr.surajit.ray at gmail.com>wrote:

> Hi,
>
> Seems like I have yet get the point across in entirety. So lets examine
> first of all the feature approach you are suggesting.
>
> If a feature can only be assigned to a compound then the following will
> happen :
>
> a) the source Dataset will create a set of substructure features (maybe
> 10000 or more) via stage 1 of the Maxtox algorithm.
>

OK

> b) Each feature will be a substructure (represented in compound form ?)
> associated with multiple compounds.
>

No, represented as a feature, it may contain additional information (e.g.
smiles, smarts), useful for your algorithm.

> c) Each feature will be in a Dataset if the feature (substructure) appeared
> while doing stage 1 (Maxtox) on the dataset.
>

OK

> d) In stage 2 each of the compounds in the Dataset will have to be
> fingerprinted by checking which of the features (mentioned in the result of
> Stage 1) appears in the compound.
>

You mean there will be Yes or No value (1/0) , assigned to the feature and
compound.

> e) this fingerprint will then be another feature of the compound stored by
> another dataset.
>

No, it will be the same feature.  ( I guess this is where the
misunderstanding gets in. )

Just it will have value 1 or 0, assigned to the compound. Could be in the
same or another dataset, or in multiple datasets.

> f) the fingerprints are then used for model building and prediction.
>
> The other way to do this would be to :
> 1) get all the substructures (as compounds ?)

as features.

> from stage 1 and store them in
> a dataset.
>

BTW, I think what you describe as "stage 1" we call "Descriptor calculation
algorithm " in OpenTox API.

> 2) use this dataset to fingerprint the compounds within the source Dataset
>

This should be actually done by a new instance of MaxTox descriptor
calculation algorithm , parameterized by the dataset, and which checks only
for substructures extracted from this dataset . At the end of processing it
might generate RDF with features, representing substructures and assign 0/1
values to compounds and post the rdf back.

> 3) use the fingerprints (features) in model building and prediction
>

 This is indeed what correspond to model building in OpenTox API.

>
> The disadvantage of method 1 is that it is convoluted and counter intuitive
>

Well, it seems to be matter of point of view.  I would say a clean algorithm
is clearly separated in modules, doing well defined tasks.

> 1) each time a set of substructures are generated a corresponding number of
> features will have to generated first.
>

> 2) After this process each feature will have to be checked to see if there
> is an existing representation (in feature form )
>

> 3) if yes then the feature will have to sought in all the compounds of the
> dataset and the compounds that contain this feature will have to be
> associated with the feature using a dataentry
>

> 4) if no then the feature will be created and then sought in the all the
> compounds and respective compounds updated
>

2-4 are not necessary in this order and details.  You might simply create
all features in its RDF form and POST it to feature service.  It will take
care of whether same features exist or not.  The same for compounds and data
entries - simply create RDF with data entries and features and POST it to
the dataset service. It will take care of handling the existing entries.

 5) the resulting dataset from the above operations will be the source
> Dataset compounds with a variable bunch of features (substructures)
> associated with them.
> 6) when we try to do stage 2 of Maxtox - this dataset (from the previous
> step) will have to be parsed to retrieve all the features and arrange them
> sequentially (an order is ok - may be atom size)
> 7) the compounds will have a fingerprint (feature) based on the
> existence/absence of a certain substructure. Which will be used for the
> predictions
>
> As you can see that the intermediate dataset created is not much help ,
> except maybe speeding up the fingerprinting process ( since the
> substructures corresponding to a compound will be stored) . Even then there
> is the additional step of finding the feature from the intermediate dataset
> and matching it to the ones stored in sequence to find the correct
> fingerprint of the compound.
>

The intermediate step (essentially a descriptor calculation) is useful if
you want to allow other OpenTox service clients to use your (substructure)
features.  If this is not the case, you might simply hide the entire
procedure and only expose the final prediction results as features. (POST
dataset with final predictions only)

>
> Secondly when you consider that when each feature (out of many thousands)
> is
> checked/created over internet and multiple accesses performed - it will
> take
> down the reliability of the process by many degrees.
>
>
> It might be created via single POST operation, not necessary separate POSTs
for every feature.

>
>
> On 30 September 2010 13:48, Nina Jeliazkova <jeliazkova.nina at gmail.com
> >wrote:
>
> > Hi All,
> >
> > On Mon, Sep 27, 2010 at 6:03 PM, chung <chvng at mail.ntua.gr> wrote:
> >
> > > On Mon, 2010-09-27 at 20:03 +0530, surajit ray wrote:
> > > > Hi,
> > > >
> > > > Well having them as features will not cut it - for the simple reason
> > > > that the "feature" in this case belongs to the the input dataset (or
> > > > whichever set is being worked upon).
> > >
> >
> > No, a feature does not belong to a dataset, but might be associated with
> > one
> > or more datasets (or none).
> >
> >
> A feature ( which is another dataset consisting of fragments) in our case
> needs to "belong" to a dataset because that immensely simplifies the
> process
> and is intuitive wrt to the original workflow
>
>
A feature is not a dataset, at least not in the meaning it has currently.
In fact I would say that the fragments dataset is something that should
belong to the particular instance of MaxTox model or algorithm, neither to a
compound, nor to a dataset.

>
> > >
> > > As far as I know as it is generally conceived in OpenTox and as far as
> > > the implementation in AMBIT is concerned, features are separated from
> > > datasets and can be standalone.
> >
> >
> > Exactly.  Features are standalone objects, can be created by POSTing to
> > /feature service and the only connection is via ot:DataEntry  , used in
> > ot:Datasets. A feature could be used by multiple datasets.
> >
> > That is, you can have a feature that
> > > does not appear in any datasets.
> >
> > You might only have a pointer to a
> > > dataset using the object property 'ot:hasSource' but this does not
> > > somehow bind the feature to the dataset. However, I'm not sure if I
> > > understood well.
> > >
> >
> >
> > This is correct.
> >
> >
> > >
> > > > The compound itself may not have
> > > > a substructure but it may be a a part of a dataset which when
> examined
> > > > will have the substructure appearing while doing the pairwise
> > > > comparisons.
> > >
> >
> > In the current setup,  the substructures are features, as any other
> > properties. If a compound has no such substructure, it may simply have no
> > such feature  assigned via ot:DatasetEntry.   This gives universal access
> > to
> > learning algorithm which can work with any kind of features, including
> > substructures.  (e.g. there is no need to tell SVM algorithm if a feature
> > is
> > a substructure or anything else).
> >
> >
> >
> lets consider a setup of around 10000 features. To manage the above each
> compound will have to checked against the 10000 features (generated or
> found) over the internet - which means 10000 times number of compounds =
> total number of request over the internet. I don't think even in first
> world
> countries the internet is that stable for this to work stably - it will
> more
> often fail.
>

I think there is again misunderstanding. It is not necessary to checked the
compound against 10000 features over the internet.  It doesn't work this
way, on the contrary, it's the algorithm being applied to a compound and
results stored as features, not features checked one by one.  Features don't
do any processing, algorithms and models do. Here is how it looks like (much
simpler than the7 steps above), in two flavours:

( The description moved at the top of the message)

>
> Our real useful feature is the fingerprint.
>
>
OK, then store fingerprints as features, but assign ot:hasSource to them
,for clients to know how to calculate them.  this will be yet another
variant , in addition to the above 1. and 2.  (added as Option 3 at the top
of the message)

>
>
> > >
> > > If you need to establish a relationship between a such a feature and a
> > > compound (so that given the feature you can retrieve the
> > > fragment/compound to which it refers in any supported MIME type), then
> > > we can extend the range of the property ot:hasSource to include also
> > > ot:Compound and assign a compound URI to such features. i.e. something
> > > like:
> > >
> > > /feature/123
> > >        a ot:Feature
> > >        ot:hasSource /compound/435
> > >
> >
> >
> > We can simply use the current construct to declare ot:hasSource point to
> > algorithm, that verifies presence of substructures, or finds
> substructures
> > by any other means.  Recall the purpose of ot:hasSource is to be able to
> > regenerate the feature, when applied to new compound.  If it points to an
> > algorithm generating the feature is straightforwards, if pointing to a
> > compound, it will not be possible.
> >
> >
> > In this case both algorithm and the source dataset will determine the
> source of the features (substructures)
>
>
This is exactly the same case, as implemented by TUM services.

>
> > >
> > > But then I'm not sure whether the following are also needed:
> > >
> > > 1. Declare that the feature above is a ot:SubstructureFeature (new) or
> > > at least declare that it is boolean.
> > >
> >
> > Extension of ot:Feature is reasonable, but it would be better if the
> > algorithms, generating substructures are described in an ontology, as any
> > other algorithms (e.g. descriptor calculation ones).
> >
> >
> A descriptor calculator algorithm ties in neatly with the present
> framework.
> But how about algorithms which are generating intermediate data - which
> leads to descriptor calculation via another algorithm ?
>
>
>
If the data is indeed intermediate, it doesn't need to be exposed by
services and can safely be encapsulated within the descriptor calculation
algorithm.

IMHO a good check if something needs to be exposed as a
service/dataset/feature is to make clear if it could be reused by some other
service, or it is just for internal processing of a single algorithm.  If it
is for internal processing, there is no need to be exposed as a service. If
it could be reused, exposing as a REST resource is essential to making the
framework work.

>
> >
> > >
> > > 2. Make it explicit that the above compound is a ot:Fragment (new)
> > >
> >
> >
> >
> >
> > >
> > > Maybe we can go without introducing extra classes.
> > >
> > > >
> > > > Using the features system in this manner is not (IHMO) the solution
> to
> > > > this problem. It will be really be cumbersome to maintain the feature
> > > > URIs which may number in many thousands and will be extremely
> > > > transient. In effect it will be lot of resources being hogged by a
> > > > system which could do with a much more simpler implementation.
> > >
> > > That is not really a problem. A feature is a very small entry in a
> > > database. There are enterprises that maintain databases of some tens of
> > > TeraBytes or even more.
> > >
> >
> > Indeed.
> >
> > Besides, having substructures as features effectively introduces caching
> of
> > substructures, which allows 1) avoid multiple calculation of the same
> > feature 2) being able to see if same substructures are used /generated by
> > different algorithms and even do some comparison.
> > 3) visualisation and statistics over features , without having to have
> all
> > of them in memory.
> >
> >
> >
> I guess we forgetting that our API is internet based and for an algorithm
> to
> check and create more than 10000 features per run (on average) is going to
> severely clamp down on the available bandwidth and server resources. Plus
> in
> such a scenario lets say we give a search request to a server containing
> the
> substructure features - we ask does substructure a exists in the database ?
> - to give an answer a millin or more checks will have to made by the server
> per request (considering that all known substructures may run into
> millions). So for the server hosting the features - each algorithm run will
> generate "grep" runs which will equal - 10000 * no of compounds * million
> (total number of substructures)
>

See comments above. There was no any intention to check features one by one,
the algorithm checks them internally and just stores all the results at
once. There was no any intention to do substructure search in a database -
this is what your algorithm should do.

>
> For a set of 500 compounds that number is 10000 * 500 * 1000000 =
> 5,000,000,000,000 searches. Which I feel is moving into Cray computer
> territory.
>
>
> > >
> > > > Moreover a certain feature in such a system will be a part of a
> > > > compound if its a part of Dataset A and may not be a part of the same
> > > > compound when examined in Dataset B.
> > > >
> > >
> > > This is true. For example if a compound does not contain C=O it is
> > > obvious it will not contain CC=O or in general RC=O.
> > >
> >
> > This is no problem currently,  features might be present in one dataset
> and
> > not in another.
> >
> >
> I would like to have a dataset (of substructures) which is a feature of the
> source Dataset and hassource as stage 1 of maxtox algorithm. This immensely
> simplifies the process - and makes life easier for all of us.
>
>
Well, perhaps the best would be to simplify things even further and have
MaxTox only reporting final prediction results.  It will hold substructures
internally in any form suitable for it.

>
> >
> > >
> > > > Summing up heres a few things I would like in the next API
> > > >
> > > > a) Ability to upload bulk compounds from scratch, using a dataset
> > > > construct (and not posting single compounds)
> > >
> > > I think this is supported. You can POST a dataset with a set of new
> > > compounds. If one or more compounds are not found in the database of
> the
> > > server they should be created.
> > >
> >
> >
> > Indeed, this is the operation most often used now.  Compounds are
> created,
> > if not found in the database.
> >
> > Great !
>
>
> >
> > >
> > > > b) Ability to assign features to datasets
> > >
> > > You mean "to append" features or have some structured meta information
> > > about the dataset itself?
> > >
> >
> > If you PUT an  RDF, containing new features to an existing dataset, they
> > will be appended to the dataset, and /dataset/id/feature  will return the
> > full set of features
> >
> >
> > Does it mean that feature will be associated with the Dataset and not to
> the individual compounds ?
>
>
>
A feature itself is a standalone object.  Note this is the feature itself,
not its value.

The value of the feature is assigned to a compound via dataset entry. There
might be multiple values, assigned to multiple compounds within one or
multiple datasets.

>
> > >
> > > > c) Ability to have non-standard datasets/compounds which contain
> > > > substructures rather than molecules.
> > >
> >
> > In fact, if smiles/ mol /sdf files, containing substructures are uploaded
> ,
> > they will be saved as it is, but there will be available via /compound
> > services.
> >
> > However, here are some of discussion points:
> > - I don't think substructures and compounds should use the same dataset
> and
> > compound API, these are semantically different resources
> >
>
> We could use another property of a compound (maybe called isFragment) or
> something to determine the type of data.
>
>
>
> > - If we consider substructures as properties of the compounds, it's more
> > logical to have substructures as features, as we try to model any
> > properties
> > via this construct.
> >
>
> Makes sense logically, but for our case and as well as in the interest of
> bandwidth utilization and server load (hosting these features) this method
> is pretty much a no go.
>
>
> > - What would be the benefit of having substructures as dataset?  I am not
> > sure there is standard way to distinguish whether given SMILES should be
> > represented as a substructure , rather than an entire compound.
> >
>
> Another field (property) called isFragment ?
>

How should the server recognise if a file with SMILES or MOLs should be
interpreted as compounds or fragments?

>
>
> > - As I understood, the substructures datasets is something used
> internally
> > by MaxTox algorithm, and not necessarily exposed to the end user. Thus,
> is
> > it really necessary to have it available via a dataset service?   Perhaps
> > it
> > is explained better in the manual Tobias is preparing , as it covers a
> > similar case.
> >
> >
> I am trying to expose the steps so as to enable an end user to create new
> maxtox models.
>

OK, what about indeed using only fingerprints as a single feature, assigned
to the compound ?

Best regards,
Nina

>
> Regards
> Surajit
>
>
>
> > Best regards,
> > Nina
> >
> >
> > > >
> > > > Regards
> > > > Surajit
> > >
> > > Best regards,
> > > Pantelis
> > > >
> > > > On 27 September 2010 18:31, chung <chvng at mail.ntua.gr> wrote:
> > > > > Hi Surajit,
> > > > >   As far as I can understand you have a problem similar to the one
> I
> > > > > was discussing with Alexey from IBMC. You need  a way to define
> which
> > > > > substructures are present in a certain structure. For this purpose
> > you
> > > > > have to use features and not compounds. So you need a collection of
> > > > > features each one of which corresponds to a certain substructure.
> > > > > However in Ambit you can create a new compound by POSTing it
> > > > > to /compound in a supported MIME (e.g. SMILES, SDF etc) for example
> > > > > 'curl -X POST --data-binary @/path/to/file.sdf -H
> > > Content-type:blah/blah
> > > > > +sdf http://someserver.com/compound'. What is needed in OpenTox
> > though
> > > > > is a collection of substructures in a feature service and a way to
> > > > > lookup for a certain feature according to its structure (e.g.
> > providing
> > > > > its SMILES representation).
> > > > >
> > > > > Best Regards,
> > > > > Pantelis
> > > > >
> > > > > On Mon, 2010-09-27 at 14:18 +0530, surajit ray wrote:
> > > > >
> > > > >> Hi Nina,
> > > > >>
> > > > >> Need to upload some fragments (have smile representations) into a
> > > > >> dataset. Is this possible in the current framework ?
> > > > >>
> > > > >> To be more elaborate -
> > > > >> Currently I am uploading a dataset with compounds as the links to
> > the
> > > > >> respective compound URIs (which happens at the end of the online
> > > > >> MaxtoxTest service). How would I upload new compounds (with
> > smile/mol
> > > > >> representations) ? And secondly if these (the upload set) happen
> to
> > be
> > > > >> fragments (and not molecules) is there a way to store such
> > information
> > > > >> using the ambit dataset service ?
> > > > >>
> > > > >> Thanx
> > > > >> Surajit
> > > > >> _______________________________________________
> > > > >> Development mailing list
> > > > >> Development at opentox.org
> > > > >> http://www.opentox.org/mailman/listinfo/development
> > > > >>
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Development mailing list
> > > > > Development at opentox.org
> > > > > http://www.opentox.org/mailman/listinfo/development
> > > > >
> > > > _______________________________________________
> > > > Development mailing list
> > > > Development at opentox.org
> > > > http://www.opentox.org/mailman/listinfo/development
> > > >
> > >
> > >
> > > _______________________________________________
> > > Development mailing list
> > > Development at opentox.org
> > > http://www.opentox.org/mailman/listinfo/development
> > >
> >
> >
> >
> > --
> >
> > Dr. Nina Jeliazkova
> > Technical Manager
> > 4 A.Kanchev str.
> > IdeaConsult Ltd.
> > 1000 Sofia, Bulgaria
> > Phone: +359 886 802011
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>