[OTDev] Datasets with Features for multi entity relationships ?
surajit ray mr.surajit.ray at gmail.comTue Nov 30 10:00:36 CET 2010
- Previous message: [OTDev] Datasets with Features for multi entity relationships ?
- Next message: [OTDev] Ontology Development in support of Predictive Toxicology Use Cases & Services
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, On 30 November 2010 14:15, Nina Jeliazkova <jeliazkova.nina at gmail.com> wrote: > Hi, > > > On 30 November 2010 10:29, surajit ray <mr.surajit.ray at gmail.com> wrote: > >> Hi, >> >> Another method that does not break the API as well as captures feature >> sets is to create a dataset with one compound (maybe a C or CC) and >> assign all the substructure features to it (with value as false or >> true). In the dc:source of the dataset we can mention the dataset from >> which it was derived. And in the description we can describe it as a >> dataset to store MCSS features from dataset (or whatever the >> relationship with the mother dataset). >> >> I think this would be a simpler method than creating a new Model just >> for storing substructures. > > > It might seem simpler, but is definitely less consistent, as it implies > different meaning of dataset and properties and their relationships. There > will be no explicit relationship to the algorithm/model , doing the > processing, which makes MCSS a specific case and breaks OpenTox API , where > algorithms and models are the procedures, that process data, and this is > explicitly stored in the generated data objects. The relationship is defined in dc:description of the "featureset". It is explicit. Secondly a reference to the algorithm which generated this can also be stored in the description. > > With the current scheme, it is easy to handle algorithms like Kabsh > alignment for a dataset with the same generic mechanism as for MCSS (I am > sure there will be more cases like this). I don't see the point of inventing > specific solution for a single case , while it could be handled in a generic > way (agree with earlier comment by Christoph on that ). This is not a specific solution but a very general one - one which addresses a basic need within any chemistry api - which is to represent sets of features independently of compounds. > Besides, the model is definitely not for just storing substructures, it can > and will be used for predictions of new compounds (if they have those > substructures ) in an uniform way (POST a new compound to the MCSS model > and you'll get if its MCSS substructures are one of existing ones, or it is > different and far way from that dataset). What if I have a better graph comparator algorithm for fingerprinting - will that take a model as an input just to extract features ? > The problem with the model approach is that >> 1) The substructures cannot be easily downloaded without accessing the >> model >> > > They can - /model/id/predicted give you the list of features (see my > examples) Well of course a whole model infrastructure may provide a way to extract the predicted feature set. But that would imply giving the model as an input a third party fingerprinter. > And also - this is exactly the advantage - you don't have just a set of > substructures you don't know when they are coming from, but everything is > explicitly defined - the substructures are result of applying given > algorithm on given dataset. We know that from the dc:source and dc:description > > >> 2) The set of substructures cannot be given to a better finger printer >> (maybe with a faster graph comparator) >> >> > Of course they can - once we have smarts representation of the > ot:Substructure - what is the obstacle of feeding them into any other > algorithm ? Again the question - are we going to use a model as an input to another algorithm to extract features ? > > > >> The fingerprinter in such a case becomes a separate algorithm which >> can take a dataset as input as well as a "featureset" - which is >> actually a dummy dataset with the full list of features. >> > > A fingerprinter should be indeed an algorithm - this is how OpenTox API is > designed. Any processing should be instantiated as an algorithm. In your case the fingerprinter is a model .... Regards Surajit > Regards, > Nina > > >> >> Cheers >> Surajit >> >> >> On 29 November 2010 14:05, Nina Jeliazkova <jeliazkova.nina at gmail.com> >> wrote: >> > Dear Christoph, Surajit, All, >> > >> > This discussion is very useful. >> > >> > As a result of myself trying to understand both points of view, now we >> have >> > MCSS algorithm as ambit service (thanks to CDK SMSD package). >> > >> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss >> > >> > It can be applied to a dataset and generates a model, where predicted >> > features (MCSS in this case) are available via ot:predictedVariables >> > (example https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted) >> > The features use current API, without any change (although having >> > ot:Substructure subclass of ot:Feature will make it more clear). >> > >> > All the MCSS substructures can be used by any learning algorithm , as >> they >> > are standard ot:Features. >> > >> > Here are more details and proposal (start from *Substructure API proposal >> > heading *) >> > >> > http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist >> > >> > Best regards, >> > Nina >> > >> > P.S. Please note the /mcss algorithm might be slow for large datasets, >> there >> > are several improvements that we'll be applying performance wise, but >> this >> > will not change the API . >> > >> > On 25 November 2010 18:13, Christoph Helma <helma at in-silico.ch> wrote: >> > >> >> Excerpts from surajit ray's message of Thu Nov 25 14:49:19 +0100 2010: >> >> >> >> > > This type of representation (we are using it internally) has served >> >> well >> >> > > for our datasets which might contain also several (10-100) thousand >> >> > > substructures for a few thousands compounds. I also do not think, >> that >> >> > > the representation is redundant: >> >> > > - each compound is represented once >> >> > > - each substructure is represented once >> >> > > - each association between compound and substructure is >> >> represented once >> >> > > Please correct me, if I am missing something obvious. >> >> > >> >> > According to this representation each dataEntry for a compound will >> >> > have to have all substructure features that were found in them. >> >> > Therefore each dataEntry may have 1000-10000 feature/featureValue >> >> > pairs . For 500 datasentries that means on an average of >> >> > 500*5000(assuming 5000 substructures) = 2,500,000 feature/featureValue >> >> > pairs - thats 2.5 million ! >> >> >> >> In our case it is a lot less (not completely sure about your feature >> >> types), because only a very small subset of features occurs in a single >> >> compound. >> >> >> >> > versus just having a featureset with a >> >> > 5000 feature entries. You can imagine the difference in cost of >> >> > bandwidth,computation etc. >> >> >> >> I am not sure, if I get you right, but where do you want to store the >> >> relationships between features and compounds? If there are really 2.5 >> >> million associations you have to assert them somewhere. And having >> features >> >> without compounds seems to be quite useless for me. >> >> >> >> > > >> >> > > Adding "false" occurences would not violate the current API (but >> would >> >> > > add redundant information). Keep in mind that the dataset >> >> representation >> >> > > is mainly for exchanging datasets between services - internally you >> can >> >> > > use any datastructure that is efficient for your purposes (we also >> do >> >> > > that in our services). So if you need fingerprints internally, >> extract >> >> > > them from the dataset. >> >> > >> >> > Internalizing an intermediate step completely serves the purpose but >> >> > leads to less flexible design paradigms. If we internalize the >> >> > workflow from substructure extraction to fingerprinting - we will lose >> >> > the ability to provide the data to a third party server for an >> >> > independent workflow. Of course the reasoning could be "who needs it >> >> > ?" - well you never know !! >> >> >> >> I am very interested in exchanging "fingerprints" with other services, >> >> but that can be done already with the current API. I see fingerprints as >> >> sets of features that are present in a compound (also using set >> >> operations to calculate similarities), and find it fairly >> >> straightforward to parse/serialize them to/from datasets. >> >> >> >> > >> >> > >> I still suggest having a FeatureSet/SubstructureSet type object >> within >> >> > >> the API to make it convenient to club features without compound >> >> > >> representations. >> >> > > >> >> > > I prefer to keep the API as generic as possible and not to introduce >> >> > > ad-hoc objects (or optimizations) for special purposes - otherwise >> it >> >> > > will be difficult to maintain services in the long term. Why don't >> you >> >> > > use ontologies for grouping features? >> >> > >> >> > Grouping features using ontologies is clubbing the features Not the >> >> > feature values >> >> >> >> But you cannot have feature values without relating features to >> >> compounds. If you use the representation I proposed feature values are >> >> "true" anyway. >> >> >> >> > So how do we know mcss3 occuring in compound X is with respect to >> >> > which compound. As you said we can have arbitary fields in the feature >> >> > definitions (for MCSS) - but that would be outside API definitions. >> >> >> >> features: >> >> mcss3: >> >> ot:componds: >> >> - compound2 >> >> - compound3 >> >> ot:smarts: smarts3 >> >> >> >> In my understanding you can add any annotation you want to a feature. >> >> >> >> >> > Yes, you can, but if this is not an agreed annotation, no other service >> > will understand it. >> > >> > Best regards, >> > Nina >> > >> > >> >> Best regards, >> >> Christoph >> >> _______________________________________________ >> >> Development mailing list >> >> Development at opentox.org >> >> http://www.opentox.org/mailman/listinfo/development >> >> >> > _______________________________________________ >> > Development mailing list >> > Development at opentox.org >> > http://www.opentox.org/mailman/listinfo/development >> > >> _______________________________________________ >> Development mailing list >> Development at opentox.org >> http://www.opentox.org/mailman/listinfo/development >> > _______________________________________________ > Development mailing list > Development at opentox.org > http://www.opentox.org/mailman/listinfo/development > -- Surajit Ray Partner www.rareindianart.com
- Previous message: [OTDev] Datasets with Features for multi entity relationships ?
- Next message: [OTDev] Ontology Development in support of Predictive Toxicology Use Cases & Services
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list