[OTDev] Datasets with Features for multi entity relationships ?

Wed Nov 24 05:11:57 CET 2010

Hi Christoph,

On 23 November 2010 22:05, Christoph Helma <helma at in-silico.ch> wrote:
> Excerpts from surajit ray's message of Tue Nov 23 09:12:16 +0100 2010:
>> Hi Christoph,
>>
>> Scrolling through the mile long RDF file - I could barely make out
>> whats going on !
>
> One of the big advantages of XML ;-)
>
>> Could you please outline in a graphical/intuitive
>> description as to what exactly is implemented in the RDF ?
>
> You are better off, if you have a look at the Turtle representation (was
> attached in the previous post) which is easier to read.
>
> A dataset with 3 compounds and 3 substructures would have the following
> basic structure (assuming that
>        - featureX occurs in compound1 and compound2
>        - featureY occurs in compound2
>        - featureZ occurs in compound1 and compound3
> )
>
> compounds:
>        - compound1
>        - compound2
>        - compound3
>
> data_entries:
>        - compound1:
>                featureX: true
>                featureZ: true
>        - compound2:
>                featureY: true
>                featureX: true
>        - compound3:
>                featureZ: true
>
> features:
>        featureX:
>                ot:smarts: cN
>                ot:pValue: 0.97
>                ot:effect: activating
>                ot:hasSource: http://webservices.in-silico.ch/algorithm/fminer/bbrc
>                ot:parameters:
>                        dataset_uri: http://webservices.in-silico.ch/dataset/1
>        featureY:
>                ot:smarts: ccc
>                ot:pValue: 0.96
>                ot:effect: deactivating
>                ot:hasSource: http://webservices.in-silico.ch/algorithm/fminer/bbrc
>                ot:parameters:
>                        dataset_uri: http://webservices.in-silico.ch/dataset/1
>        featureZ:
>                ot:smarts: N(O)=O
>                ot:pValue: 0.99
>                ot:effect: activating
>                ot:hasSource: http://webservices.in-silico.ch/algorithm/fminer/bbrc
>                ot:parameters:
>                        dataset_uri: http://webservices.in-silico.ch/dataset/1
>

For a large dataset, the number of substructures mined by a given
algorithm may be large (in the rage of thousands). Now according this
representation - a substructure which occurs in 80% of the compounds
will have to be associated with 80% of the dataset - vastly increasing
the size of the dataset representation. Iterating over all the
substructures may yield a dataset of gigantic proportions.

For our use case we do not really need this as we are anyway
fingerprinting each compound with  the occurrence of the substructures
mined. Furthermore the present representation cannot be called a
fingerprint (of the compounds) with respect to the substructures as we
would then have to fit in the "FALSE" occurrences as well ( the
features which do not occur would have to mentioned with a value
false). Therefore this representation is not serving the fingerprint
functionality as well, without additional processing.

I still suggest having a FeatureSet/SubstructureSet type object within
the API to make it convenient to club features without compound
representations.

>> Also I have a question about mutually common relationships like MCSS.
>> MCSS is common to both compounds (being compared). So in your
>> representation would it be necessary to represent the relationship
>> twice ? That is once for each compound - or can it be represented just
>> once and be associated with both compounds ?
>
> I would do it like this:
>
> compounds:
>        - compound1
>        - compound2
>
> data_entries:
>        - compound1:
>                mcss_feature: true
>        - compound2:
>                mcss_feature: true
>
> features:
>        mcss_feature:
>                ot:smarts: c1cccc1(CC)
>                ot:hasSource: your_mcss_service_uri
>

Does this imply that the dataset will be locked. Without locking the
dataset onto the two compounds (whose MCSS is being represented) -
this representation will not work as it is not showing the three way
relationship. MCSS can have a value of a smarts string and "occur" in
a compound. But MCSS has to have a third entry - which is the second
compound being compared to. The above representation can "imply" this
relationship if the Dataset is locked on the two compounds. Which
essentially brings us back to the original premise of assigning such
"relationship" features to locked datasets.

Regards
Surajit

> Best regards,
> Christoph
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>