[OTDev] Datasets with Features for multi entity relationships ?

Thu Nov 25 14:49:19 CET 2010

Hi,

On 25 November 2010 18:47, Christoph Helma <helma at in-silico.ch> wrote:
> Surajit,
>
> Excerpts from surajit ray's message of Wed Nov 24 05:11:57 +0100 2010:
>>
>> For a large dataset, the number of substructures mined by a given
>> algorithm may be large (in the rage of thousands). Now according this
>> representation - a substructure which occurs in 80% of the compounds
>> will have to be associated with 80% of the dataset - vastly increasing
>> the size of the dataset representation. Iterating over all the
>> substructures may yield a dataset of gigantic proportions.
>
> This type of representation (we are using it internally) has served well
> for our datasets which might contain also several (10-100) thousand
> substructures for a few thousands compounds. I also do not think, that
> the representation is redundant:
>        - each compound is represented once
>        - each substructure is represented once
>        - each association between compound and substructure is represented once
> Please correct me, if I am missing something obvious.

According to this representation each dataEntry for a compound will
have to have all substructure features that were found in them.
Therefore each dataEntry may have 1000-10000 feature/featureValue
pairs . For 500 datasentries that means on an average of
500*5000(assuming 5000 substructures) = 2,500,000 feature/featureValue
pairs - thats 2.5 million ! versus just having a featureset with a
5000 feature entries. You can imagine the difference in cost of
bandwidth,computation etc.

> It can be problematic to serialize such datasets to OWL-DL (our
> benchmarks showed that building the RDF graph is the main bottleneck),
> but this is more a RDF/OWL problem than a problem with the basic dataset
> structure. Omitting RDF libraries (and thus building RDF graphs) and
> serializing directly to strings leads (so far) to impressive performance
> gains.

Well RDF is the standard set already, and unless we are changing that
- we will have to design improvements around the fact that RDF
representation is a necessary phenomenon.

>> For our use case we do not really need this as we are anyway
>> fingerprinting each compound with  the occurrence of the substructures
>> mined. Furthermore the present representation cannot be called a
>> fingerprint (of the compounds) with respect to the substructures as we
>> would then have to fit in the "FALSE" occurrences as well ( the
>> features which do not occur would have to mentioned with a value
>> false). Therefore this representation is not serving the fingerprint
>> functionality as well, without additional processing.
>
> Adding "false" occurences would not violate the current API (but would
> add redundant information). Keep in mind that the dataset representation
> is mainly for exchanging datasets between services - internally you can
> use any datastructure that is efficient for your purposes (we also do
> that in our services). So if you need fingerprints internally, extract
> them from the dataset.

Internalizing an intermediate step completely serves the purpose but
leads to less flexible design paradigms. If we internalize the
workflow from substructure extraction to fingerprinting - we will lose
the ability to provide the data to a third party server for an
independent workflow. Of course the reasoning could be "who needs it
?" - well you never know !!

>> I still suggest having a FeatureSet/SubstructureSet type object within
>> the API to make it convenient to club features without compound
>> representations.
>
> I prefer to keep the API as generic as possible and not to introduce
> ad-hoc objects (or optimizations) for special purposes - otherwise it
> will be difficult to maintain services in the long term. Why don't you
> use ontologies for grouping features?

Grouping features using ontologies is clubbing the features Not the
feature values

>> >> Also I have a question about mutually common relationships like MCSS.
>> >> MCSS is common to both compounds (being compared). So in your
>> >> representation would it be necessary to represent the relationship
>> >> twice ? That is once for each compound - or can it be represented just
>> >> once and be associated with both compounds ?
>
> You can of course put arbitrary data into the features representation, like:
>
> mcss_feature:
>        ot:compounds:
>                - compound1
>                - compound2
>        ot:smarts: c1cccc1(CC)
>        ot:hasSource: your_mcss_service_uri
>
> But as a client I would expect to find the association between compounds
> and features in the data_entries.

Exactly

>> Does this imply that the dataset will be locked. Without locking the
>> dataset onto the two compounds (whose MCSS is being represented) -
>> this representation will not work as it is not showing the three way
>> relationship. MCSS can have a value of a smarts string and "occur" in
>> a compound. But MCSS has to have a third entry - which is the second
>> compound being compared to. The above representation can "imply" this
>> relationship if the Dataset is locked on the two compounds. Which
>> essentially brings us back to the original premise of assigning such
>> "relationship" features to locked datasets.
>
> What do you mean by locked? You can of course represent multiple MCSSs in a single dataset:
>
> compounds:
>        - compound1
>        - compound2
>        - compound3
>
> data_entries:
>        - compound1:
>                mcss1: true
>                mcss2: true
>        - compound2:
>                mcss1: true
>                mcss3: true
>        - compound3:
>                mcss2: true
>                mcss3: true
>
> features:
>        mcss1:
>                ot:smarts: smarts1
>        mcss2:
>                ot:smarts: smarts2
>        mcss3:
>                ot:smarts: smarts3

So how do we know mcss3 occuring in compound X is with respect to
which compound. As you said we can have arbitary fields in the feature
definitions (for MCSS) - but that would be outside API definitions.

Regards
Surajit

> Best regards,
> Christoph
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>

-- 
Surajit Ray
Partner
www.rareindianart.com