[OTDev] Datasets with Features for multi entity relationships ?
surajit ray mr.surajit.ray at gmail.comThu Nov 25 14:49:19 CET 2010
- Previous message: [OTDev] Datasets with Features for multi entity relationships ?
- Next message: [OTDev] Datasets with Features for multi entity relationships ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, On 25 November 2010 18:47, Christoph Helma <helma at in-silico.ch> wrote: > Surajit, > > Excerpts from surajit ray's message of Wed Nov 24 05:11:57 +0100 2010: >> >> For a large dataset, the number of substructures mined by a given >> algorithm may be large (in the rage of thousands). Now according this >> representation - a substructure which occurs in 80% of the compounds >> will have to be associated with 80% of the dataset - vastly increasing >> the size of the dataset representation. Iterating over all the >> substructures may yield a dataset of gigantic proportions. > > This type of representation (we are using it internally) has served well > for our datasets which might contain also several (10-100) thousand > substructures for a few thousands compounds. I also do not think, that > the representation is redundant: > - each compound is represented once > - each substructure is represented once > - each association between compound and substructure is represented once > Please correct me, if I am missing something obvious. According to this representation each dataEntry for a compound will have to have all substructure features that were found in them. Therefore each dataEntry may have 1000-10000 feature/featureValue pairs . For 500 datasentries that means on an average of 500*5000(assuming 5000 substructures) = 2,500,000 feature/featureValue pairs - thats 2.5 million ! versus just having a featureset with a 5000 feature entries. You can imagine the difference in cost of bandwidth,computation etc. > It can be problematic to serialize such datasets to OWL-DL (our > benchmarks showed that building the RDF graph is the main bottleneck), > but this is more a RDF/OWL problem than a problem with the basic dataset > structure. Omitting RDF libraries (and thus building RDF graphs) and > serializing directly to strings leads (so far) to impressive performance > gains. Well RDF is the standard set already, and unless we are changing that - we will have to design improvements around the fact that RDF representation is a necessary phenomenon. >> For our use case we do not really need this as we are anyway >> fingerprinting each compound with the occurrence of the substructures >> mined. Furthermore the present representation cannot be called a >> fingerprint (of the compounds) with respect to the substructures as we >> would then have to fit in the "FALSE" occurrences as well ( the >> features which do not occur would have to mentioned with a value >> false). Therefore this representation is not serving the fingerprint >> functionality as well, without additional processing. > > Adding "false" occurences would not violate the current API (but would > add redundant information). Keep in mind that the dataset representation > is mainly for exchanging datasets between services - internally you can > use any datastructure that is efficient for your purposes (we also do > that in our services). So if you need fingerprints internally, extract > them from the dataset. Internalizing an intermediate step completely serves the purpose but leads to less flexible design paradigms. If we internalize the workflow from substructure extraction to fingerprinting - we will lose the ability to provide the data to a third party server for an independent workflow. Of course the reasoning could be "who needs it ?" - well you never know !! >> I still suggest having a FeatureSet/SubstructureSet type object within >> the API to make it convenient to club features without compound >> representations. > > I prefer to keep the API as generic as possible and not to introduce > ad-hoc objects (or optimizations) for special purposes - otherwise it > will be difficult to maintain services in the long term. Why don't you > use ontologies for grouping features? Grouping features using ontologies is clubbing the features Not the feature values >> >> Also I have a question about mutually common relationships like MCSS. >> >> MCSS is common to both compounds (being compared). So in your >> >> representation would it be necessary to represent the relationship >> >> twice ? That is once for each compound - or can it be represented just >> >> once and be associated with both compounds ? > > You can of course put arbitrary data into the features representation, like: > > mcss_feature: > ot:compounds: > - compound1 > - compound2 > ot:smarts: c1cccc1(CC) > ot:hasSource: your_mcss_service_uri > > But as a client I would expect to find the association between compounds > and features in the data_entries. Exactly >> Does this imply that the dataset will be locked. Without locking the >> dataset onto the two compounds (whose MCSS is being represented) - >> this representation will not work as it is not showing the three way >> relationship. MCSS can have a value of a smarts string and "occur" in >> a compound. But MCSS has to have a third entry - which is the second >> compound being compared to. The above representation can "imply" this >> relationship if the Dataset is locked on the two compounds. Which >> essentially brings us back to the original premise of assigning such >> "relationship" features to locked datasets. > > What do you mean by locked? You can of course represent multiple MCSSs in a single dataset: > > compounds: > - compound1 > - compound2 > - compound3 > > data_entries: > - compound1: > mcss1: true > mcss2: true > - compound2: > mcss1: true > mcss3: true > - compound3: > mcss2: true > mcss3: true > > features: > mcss1: > ot:smarts: smarts1 > mcss2: > ot:smarts: smarts2 > mcss3: > ot:smarts: smarts3 So how do we know mcss3 occuring in compound X is with respect to which compound. As you said we can have arbitary fields in the feature definitions (for MCSS) - but that would be outside API definitions. Regards Surajit > Best regards, > Christoph > _______________________________________________ > Development mailing list > Development at opentox.org > http://www.opentox.org/mailman/listinfo/development > -- Surajit Ray Partner www.rareindianart.com
- Previous message: [OTDev] Datasets with Features for multi entity relationships ?
- Next message: [OTDev] Datasets with Features for multi entity relationships ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list