[OTDev] Uploading non-standard datasets

Thu Sep 30 13:48:14 CEST 2010

Hi All,

On Mon, Sep 27, 2010 at 6:03 PM, chung <chvng at mail.ntua.gr> wrote:

> On Mon, 2010-09-27 at 20:03 +0530, surajit ray wrote:
> > Hi,
> >
> > Well having them as features will not cut it - for the simple reason
> > that the "feature" in this case belongs to the the input dataset (or
> > whichever set is being worked upon).
>

No, a feature does not belong to a dataset, but might be associated with one
or more datasets (or none).

>
> As far as I know as it is generally conceived in OpenTox and as far as
> the implementation in AMBIT is concerned, features are separated from
> datasets and can be standalone.

Exactly.  Features are standalone objects, can be created by POSTing to
/feature service and the only connection is via ot:DataEntry  , used in
ot:Datasets. A feature could be used by multiple datasets.

That is, you can have a feature that
> does not appear in any datasets.

You might only have a pointer to a
> dataset using the object property 'ot:hasSource' but this does not
> somehow bind the feature to the dataset. However, I'm not sure if I
> understood well.
>

This is correct.

>
> > The compound itself may not have
> > a substructure but it may be a a part of a dataset which when examined
> > will have the substructure appearing while doing the pairwise
> > comparisons.
>

In the current setup,  the substructures are features, as any other
properties. If a compound has no such substructure, it may simply have no
such feature  assigned via ot:DatasetEntry.   This gives universal access to
learning algorithm which can work with any kind of features, including
substructures.  (e.g. there is no need to tell SVM algorithm if a feature is
a substructure or anything else).

>
> If you need to establish a relationship between a such a feature and a
> compound (so that given the feature you can retrieve the
> fragment/compound to which it refers in any supported MIME type), then
> we can extend the range of the property ot:hasSource to include also
> ot:Compound and assign a compound URI to such features. i.e. something
> like:
>
> /feature/123
>        a ot:Feature
>        ot:hasSource /compound/435
>

We can simply use the current construct to declare ot:hasSource point to
algorithm, that verifies presence of substructures, or finds substructures
by any other means.  Recall the purpose of ot:hasSource is to be able to
regenerate the feature, when applied to new compound.  If it points to an
algorithm generating the feature is straightforwards, if pointing to a
compound, it will not be possible.

>
> But then I'm not sure whether the following are also needed:
>
> 1. Declare that the feature above is a ot:SubstructureFeature (new) or
> at least declare that it is boolean.
>

Extension of ot:Feature is reasonable, but it would be better if the
algorithms, generating substructures are described in an ontology, as any
other algorithms (e.g. descriptor calculation ones).

>
> 2. Make it explicit that the above compound is a ot:Fragment (new)
>

>
> Maybe we can go without introducing extra classes.
>
> >
> > Using the features system in this manner is not (IHMO) the solution to
> > this problem. It will be really be cumbersome to maintain the feature
> > URIs which may number in many thousands and will be extremely
> > transient. In effect it will be lot of resources being hogged by a
> > system which could do with a much more simpler implementation.
>
> That is not really a problem. A feature is a very small entry in a
> database. There are enterprises that maintain databases of some tens of
> TeraBytes or even more.
>

Indeed.

Besides, having substructures as features effectively introduces caching of
substructures, which allows 1) avoid multiple calculation of the same
feature 2) being able to see if same substructures are used /generated by
different algorithms and even do some comparison.
3) visualisation and statistics over features , without having to have all
of them in memory.

>
> > Moreover a certain feature in such a system will be a part of a
> > compound if its a part of Dataset A and may not be a part of the same
> > compound when examined in Dataset B.
> >
>
> This is true. For example if a compound does not contain C=O it is
> obvious it will not contain CC=O or in general RC=O.
>

This is no problem currently,  features might be present in one dataset and
not in another.

>
> > Summing up heres a few things I would like in the next API
> >
> > a) Ability to upload bulk compounds from scratch, using a dataset
> > construct (and not posting single compounds)
>
> I think this is supported. You can POST a dataset with a set of new
> compounds. If one or more compounds are not found in the database of the
> server they should be created.
>

Indeed, this is the operation most often used now.  Compounds are created,
if not found in the database.

>
> > b) Ability to assign features to datasets
>
> You mean "to append" features or have some structured meta information
> about the dataset itself?
>

If you PUT an  RDF, containing new features to an existing dataset, they
will be appended to the dataset, and /dataset/id/feature  will return the
full set of features

>
> > c) Ability to have non-standard datasets/compounds which contain
> > substructures rather than molecules.
>

In fact, if smiles/ mol /sdf files, containing substructures are uploaded ,
they will be saved as it is, but there will be available via /compound
services.

However, here are some of discussion points:
- I don't think substructures and compounds should use the same dataset and
compound API, these are semantically different resources
- If we consider substructures as properties of the compounds, it's more
logical to have substructures as features, as we try to model any properties
via this construct.
- What would be the benefit of having substructures as dataset?  I am not
sure there is standard way to distinguish whether given SMILES should be
represented as a substructure , rather than an entire compound.
- As I understood, the substructures datasets is something used internally
by MaxTox algorithm, and not necessarily exposed to the end user. Thus, is
it really necessary to have it available via a dataset service?   Perhaps it
is explained better in the manual Tobias is preparing , as it covers a
similar case.

Best regards,
Nina

> >
> > Regards
> > Surajit
>
> Best regards,
> Pantelis
> >
> > On 27 September 2010 18:31, chung <chvng at mail.ntua.gr> wrote:
> > > Hi Surajit,
> > >   As far as I can understand you have a problem similar to the one I
> > > was discussing with Alexey from IBMC. You need  a way to define which
> > > substructures are present in a certain structure. For this purpose you
> > > have to use features and not compounds. So you need a collection of
> > > features each one of which corresponds to a certain substructure.
> > > However in Ambit you can create a new compound by POSTing it
> > > to /compound in a supported MIME (e.g. SMILES, SDF etc) for example
> > > 'curl -X POST --data-binary @/path/to/file.sdf -H
> Content-type:blah/blah
> > > +sdf http://someserver.com/compound'. What is needed in OpenTox though
> > > is a collection of substructures in a feature service and a way to
> > > lookup for a certain feature according to its structure (e.g. providing
> > > its SMILES representation).
> > >
> > > Best Regards,
> > > Pantelis
> > >
> > > On Mon, 2010-09-27 at 14:18 +0530, surajit ray wrote:
> > >
> > >> Hi Nina,
> > >>
> > >> Need to upload some fragments (have smile representations) into a
> > >> dataset. Is this possible in the current framework ?
> > >>
> > >> To be more elaborate -
> > >> Currently I am uploading a dataset with compounds as the links to the
> > >> respective compound URIs (which happens at the end of the online
> > >> MaxtoxTest service). How would I upload new compounds (with smile/mol
> > >> representations) ? And secondly if these (the upload set) happen to be
> > >> fragments (and not molecules) is there a way to store such information
> > >> using the ambit dataset service ?
> > >>
> > >> Thanx
> > >> Surajit
> > >> _______________________________________________
> > >> Development mailing list
> > >> Development at opentox.org
> > >> http://www.opentox.org/mailman/listinfo/development
> > >>
> > >
> > >
> > > _______________________________________________
> > > Development mailing list
> > > Development at opentox.org
> > > http://www.opentox.org/mailman/listinfo/development
> > >
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >
>
>
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>

-- 

Dr. Nina Jeliazkova
Technical Manager
4 A.Kanchev str.
IdeaConsult Ltd.
1000 Sofia, Bulgaria
Phone: +359 886 802011