[OTDev] Datasets with Features for multi entity relationships ?
surajit ray mr.surajit.ray at gmail.comThu Nov 25 14:52:30 CET 2010
- Previous message: [OTDev] Datasets with Features for multi entity relationships ?
- Next message: [OTDev] Datasets with Features for multi entity relationships ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Another problem with grouping features using ontologies is I cannot use this method to further assign such feature sets to datasets etc. On 25 November 2010 19:19, surajit ray <mr.surajit.ray at gmail.com> wrote: > Hi, > > On 25 November 2010 18:47, Christoph Helma <helma at in-silico.ch> wrote: >> Surajit, >> >> Excerpts from surajit ray's message of Wed Nov 24 05:11:57 +0100 2010: >>> >>> For a large dataset, the number of substructures mined by a given >>> algorithm may be large (in the rage of thousands). Now according this >>> representation - a substructure which occurs in 80% of the compounds >>> will have to be associated with 80% of the dataset - vastly increasing >>> the size of the dataset representation. Iterating over all the >>> substructures may yield a dataset of gigantic proportions. >> >> This type of representation (we are using it internally) has served well >> for our datasets which might contain also several (10-100) thousand >> substructures for a few thousands compounds. I also do not think, that >> the representation is redundant: >> - each compound is represented once >> - each substructure is represented once >> - each association between compound and substructure is represented once >> Please correct me, if I am missing something obvious. > > According to this representation each dataEntry for a compound will > have to have all substructure features that were found in them. > Therefore each dataEntry may have 1000-10000 feature/featureValue > pairs . For 500 datasentries that means on an average of > 500*5000(assuming 5000 substructures) = 2,500,000 feature/featureValue > pairs - thats 2.5 million ! versus just having a featureset with a > 5000 feature entries. You can imagine the difference in cost of > bandwidth,computation etc. > >> It can be problematic to serialize such datasets to OWL-DL (our >> benchmarks showed that building the RDF graph is the main bottleneck), >> but this is more a RDF/OWL problem than a problem with the basic dataset >> structure. Omitting RDF libraries (and thus building RDF graphs) and >> serializing directly to strings leads (so far) to impressive performance >> gains. > > Well RDF is the standard set already, and unless we are changing that > - we will have to design improvements around the fact that RDF > representation is a necessary phenomenon. > >>> For our use case we do not really need this as we are anyway >>> fingerprinting each compound with the occurrence of the substructures >>> mined. Furthermore the present representation cannot be called a >>> fingerprint (of the compounds) with respect to the substructures as we >>> would then have to fit in the "FALSE" occurrences as well ( the >>> features which do not occur would have to mentioned with a value >>> false). Therefore this representation is not serving the fingerprint >>> functionality as well, without additional processing. >> >> Adding "false" occurences would not violate the current API (but would >> add redundant information). Keep in mind that the dataset representation >> is mainly for exchanging datasets between services - internally you can >> use any datastructure that is efficient for your purposes (we also do >> that in our services). So if you need fingerprints internally, extract >> them from the dataset. > > Internalizing an intermediate step completely serves the purpose but > leads to less flexible design paradigms. If we internalize the > workflow from substructure extraction to fingerprinting - we will lose > the ability to provide the data to a third party server for an > independent workflow. Of course the reasoning could be "who needs it > ?" - well you never know !! > > >>> I still suggest having a FeatureSet/SubstructureSet type object within >>> the API to make it convenient to club features without compound >>> representations. >> >> I prefer to keep the API as generic as possible and not to introduce >> ad-hoc objects (or optimizations) for special purposes - otherwise it >> will be difficult to maintain services in the long term. Why don't you >> use ontologies for grouping features? > > Grouping features using ontologies is clubbing the features Not the > feature values > >>> >> Also I have a question about mutually common relationships like MCSS. >>> >> MCSS is common to both compounds (being compared). So in your >>> >> representation would it be necessary to represent the relationship >>> >> twice ? That is once for each compound - or can it be represented just >>> >> once and be associated with both compounds ? >> >> You can of course put arbitrary data into the features representation, like: >> >> mcss_feature: >> ot:compounds: >> - compound1 >> - compound2 >> ot:smarts: c1cccc1(CC) >> ot:hasSource: your_mcss_service_uri >> >> But as a client I would expect to find the association between compounds >> and features in the data_entries. > > Exactly > >>> Does this imply that the dataset will be locked. Without locking the >>> dataset onto the two compounds (whose MCSS is being represented) - >>> this representation will not work as it is not showing the three way >>> relationship. MCSS can have a value of a smarts string and "occur" in >>> a compound. But MCSS has to have a third entry - which is the second >>> compound being compared to. The above representation can "imply" this >>> relationship if the Dataset is locked on the two compounds. Which >>> essentially brings us back to the original premise of assigning such >>> "relationship" features to locked datasets. >> >> What do you mean by locked? You can of course represent multiple MCSSs in a single dataset: >> >> compounds: >> - compound1 >> - compound2 >> - compound3 >> >> data_entries: >> - compound1: >> mcss1: true >> mcss2: true >> - compound2: >> mcss1: true >> mcss3: true >> - compound3: >> mcss2: true >> mcss3: true >> >> features: >> mcss1: >> ot:smarts: smarts1 >> mcss2: >> ot:smarts: smarts2 >> mcss3: >> ot:smarts: smarts3 > > So how do we know mcss3 occuring in compound X is with respect to > which compound. As you said we can have arbitary fields in the feature > definitions (for MCSS) - but that would be outside API definitions. > > Regards > Surajit > >> Best regards, >> Christoph >> _______________________________________________ >> Development mailing list >> Development at opentox.org >> http://www.opentox.org/mailman/listinfo/development >> > > > > -- > Surajit Ray > Partner > www.rareindianart.com > -- Surajit Ray Partner www.rareindianart.com
- Previous message: [OTDev] Datasets with Features for multi entity relationships ?
- Next message: [OTDev] Datasets with Features for multi entity relationships ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list