[OTDev] Datasets with Features for multi entity relationships ?

surajit ray mr.surajit.ray at gmail.com
Thu Nov 25 14:52:30 CET 2010


Another problem with grouping features using ontologies is I cannot
use this method to further assign such feature sets to datasets etc.

On 25 November 2010 19:19, surajit ray <mr.surajit.ray at gmail.com> wrote:
> Hi,
>
> On 25 November 2010 18:47, Christoph Helma <helma at in-silico.ch> wrote:
>> Surajit,
>>
>> Excerpts from surajit ray's message of Wed Nov 24 05:11:57 +0100 2010:
>>>
>>> For a large dataset, the number of substructures mined by a given
>>> algorithm may be large (in the rage of thousands). Now according this
>>> representation - a substructure which occurs in 80% of the compounds
>>> will have to be associated with 80% of the dataset - vastly increasing
>>> the size of the dataset representation. Iterating over all the
>>> substructures may yield a dataset of gigantic proportions.
>>
>> This type of representation (we are using it internally) has served well
>> for our datasets which might contain also several (10-100) thousand
>> substructures for a few thousands compounds. I also do not think, that
>> the representation is redundant:
>>        - each compound is represented once
>>        - each substructure is represented once
>>        - each association between compound and substructure is represented once
>> Please correct me, if I am missing something obvious.
>
> According to this representation each dataEntry for a compound will
> have to have all substructure features that were found in them.
> Therefore each dataEntry may have 1000-10000 feature/featureValue
> pairs . For 500 datasentries that means on an average of
> 500*5000(assuming 5000 substructures) = 2,500,000 feature/featureValue
> pairs - thats 2.5 million ! versus just having a featureset with a
> 5000 feature entries. You can imagine the difference in cost of
> bandwidth,computation etc.
>
>> It can be problematic to serialize such datasets to OWL-DL (our
>> benchmarks showed that building the RDF graph is the main bottleneck),
>> but this is more a RDF/OWL problem than a problem with the basic dataset
>> structure. Omitting RDF libraries (and thus building RDF graphs) and
>> serializing directly to strings leads (so far) to impressive performance
>> gains.
>
> Well RDF is the standard set already, and unless we are changing that
> - we will have to design improvements around the fact that RDF
> representation is a necessary phenomenon.
>
>>> For our use case we do not really need this as we are anyway
>>> fingerprinting each compound with  the occurrence of the substructures
>>> mined. Furthermore the present representation cannot be called a
>>> fingerprint (of the compounds) with respect to the substructures as we
>>> would then have to fit in the "FALSE" occurrences as well ( the
>>> features which do not occur would have to mentioned with a value
>>> false). Therefore this representation is not serving the fingerprint
>>> functionality as well, without additional processing.
>>
>> Adding "false" occurences would not violate the current API (but would
>> add redundant information). Keep in mind that the dataset representation
>> is mainly for exchanging datasets between services - internally you can
>> use any datastructure that is efficient for your purposes (we also do
>> that in our services). So if you need fingerprints internally, extract
>> them from the dataset.
>
> Internalizing an intermediate step completely serves the purpose but
> leads to less flexible design paradigms. If we internalize the
> workflow from substructure extraction to fingerprinting - we will lose
> the ability to provide the data to a third party server for an
> independent workflow. Of course the reasoning could be "who needs it
> ?" - well you never know !!
>
>
>>> I still suggest having a FeatureSet/SubstructureSet type object within
>>> the API to make it convenient to club features without compound
>>> representations.
>>
>> I prefer to keep the API as generic as possible and not to introduce
>> ad-hoc objects (or optimizations) for special purposes - otherwise it
>> will be difficult to maintain services in the long term. Why don't you
>> use ontologies for grouping features?
>
> Grouping features using ontologies is clubbing the features Not the
> feature values
>
>>> >> Also I have a question about mutually common relationships like MCSS.
>>> >> MCSS is common to both compounds (being compared). So in your
>>> >> representation would it be necessary to represent the relationship
>>> >> twice ? That is once for each compound - or can it be represented just
>>> >> once and be associated with both compounds ?
>>
>> You can of course put arbitrary data into the features representation, like:
>>
>> mcss_feature:
>>        ot:compounds:
>>                - compound1
>>                - compound2
>>        ot:smarts: c1cccc1(CC)
>>        ot:hasSource: your_mcss_service_uri
>>
>> But as a client I would expect to find the association between compounds
>> and features in the data_entries.
>
> Exactly
>
>>> Does this imply that the dataset will be locked. Without locking the
>>> dataset onto the two compounds (whose MCSS is being represented) -
>>> this representation will not work as it is not showing the three way
>>> relationship. MCSS can have a value of a smarts string and "occur" in
>>> a compound. But MCSS has to have a third entry - which is the second
>>> compound being compared to. The above representation can "imply" this
>>> relationship if the Dataset is locked on the two compounds. Which
>>> essentially brings us back to the original premise of assigning such
>>> "relationship" features to locked datasets.
>>
>> What do you mean by locked? You can of course represent multiple MCSSs in a single dataset:
>>
>> compounds:
>>        - compound1
>>        - compound2
>>        - compound3
>>
>> data_entries:
>>        - compound1:
>>                mcss1: true
>>                mcss2: true
>>        - compound2:
>>                mcss1: true
>>                mcss3: true
>>        - compound3:
>>                mcss2: true
>>                mcss3: true
>>
>> features:
>>        mcss1:
>>                ot:smarts: smarts1
>>        mcss2:
>>                ot:smarts: smarts2
>>        mcss3:
>>                ot:smarts: smarts3
>
> So how do we know mcss3 occuring in compound X is with respect to
> which compound. As you said we can have arbitary fields in the feature
> definitions (for MCSS) - but that would be outside API definitions.
>
> Regards
> Surajit
>
>> Best regards,
>> Christoph
>> _______________________________________________
>> Development mailing list
>> Development at opentox.org
>> http://www.opentox.org/mailman/listinfo/development
>>
>
>
>
> --
> Surajit Ray
> Partner
> www.rareindianart.com
>



-- 
Surajit Ray
Partner
www.rareindianart.com



More information about the Development mailing list