[OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms
Nina Jeliazkova jeliazkova.nina at gmail.comTue Nov 30 10:22:53 CET 2010
- Previous message: [OTDev] Should AA take place before the creation of a task?
- Next message: [OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Surajit, On 30 November 2010 11:00, surajit ray <mr.surajit.ray at gmail.com> wrote: > Hi, > > > On 30 November 2010 14:15, Nina Jeliazkova <jeliazkova.nina at gmail.com> > wrote: > > Hi, > > > > > > On 30 November 2010 10:29, surajit ray <mr.surajit.ray at gmail.com> wrote: > > > >> Hi, > >> > >> Another method that does not break the API as well as captures feature > >> sets is to create a dataset with one compound (maybe a C or CC) and > >> assign all the substructure features to it (with value as false or > >> true). In the dc:source of the dataset we can mention the dataset from > >> which it was derived. And in the description we can describe it as a > >> dataset to store MCSS features from dataset (or whatever the > >> relationship with the mother dataset). > >> > >> I think this would be a simpler method than creating a new Model just > >> for storing substructures. > > > > > > It might seem simpler, but is definitely less consistent, as it implies > > different meaning of dataset and properties and their relationships. > There > > will be no explicit relationship to the algorithm/model , doing the > > processing, which makes MCSS a specific case and breaks OpenTox API , > where > > algorithms and models are the procedures, that process data, and this is > > explicitly stored in the generated data objects. > > The relationship is defined in dc:description of the "featureset". It > is explicit. Secondly a reference to the algorithm which generated > this can also be stored in the description. > dc:description is an annotation property and does not define any relationship between classes. Besides, this breaks OpenTox API, as it differs from the way other relationships are defined. > > > > > With the current scheme, it is easy to handle algorithms like Kabsh > > alignment for a dataset with the same generic mechanism as for MCSS (I > am > > sure there will be more cases like this). I don't see the point of > inventing > > specific solution for a single case , while it could be handled in a > generic > > way (agree with earlier comment by Christoph on that ). > > This is not a specific solution but a very general one - one which > addresses a basic need within any chemistry api - which is to > represent sets of features independently of compounds. > We have features(ot:Feature) independent of compounds (ot:Compound) . What makes most sense in modeling , is to have relationship between features and compounds (the values). What you are implying is that substructures are both features and compounds - which they are not, and mixing them is leading to errors and confusion. If you have substructure "C" and use it for SMARTS searching, it will look for a single carbon atom. If you have a compound , defined by smiles "C" , it implies CH4 , which is different. Mixing both is not a good idea, making the difference explicit makes harder to misinterpret things. > > > Besides, the model is definitely not for just storing substructures, it > can > > and will be used for predictions of new compounds (if they have those > > substructures ) in an uniform way (POST a new compound to the MCSS model > > and you'll get if its MCSS substructures are one of existing ones, or it > is > > different and far way from that dataset). > > What if I have a better graph comparator algorithm for fingerprinting > - will that take a model as an input just to extract features ? > No, define /algorithm/myfingerprint , which takes feature_uris[] as input parameters curl -X POST /algorithm/myfingerprint -d "feature_uris[]=/model/mcss1/predicted" This will work with any set of features. Even better, if you would like to convert features to a dataset, define a converter algorithm, which takes features, verifies if they are substructures and generates a dataset. curl -X POST /algorithm/features2dataset -d "feature_uris[]=/model/mcss1/predicted" -> /dataset/newdatasetfromfeatures Then you are done, POST the dataset into other algorithms as usual. > > > The problem with the model approach is that > >> 1) The substructures cannot be easily downloaded without accessing the > >> model > >> > > > > They can - /model/id/predicted give you the list of features (see my > > examples) > > Well of course a whole model infrastructure may provide a way to > extract the predicted feature set. But that would imply giving the > model as an input a third party fingerprinter. > > Not necessarily, see above > > > And also - this is exactly the advantage - you don't have just a set of > > substructures you don't know when they are coming from, but everything is > > explicitly defined - the substructures are result of applying given > > algorithm on given dataset. > > We know that from the dc:source and dc:description > No, we don't. These are annotation properties, not object properties. They might provide hints for human readers, while the whole framework strives to provide explicit relationships for automatic processing. > > > > > >> 2) The set of substructures cannot be given to a better finger printer > >> (maybe with a faster graph comparator) > >> > >> > > Of course they can - once we have smarts representation of the > > ot:Substructure - what is the obstacle of feeding them into any other > > algorithm ? > > Again the question - are we going to use a model as an input to > another algorithm to extract features ? > No, see above - features are already available as /model/id/predicted - this is just set of features. > > > > > > > >> The fingerprinter in such a case becomes a separate algorithm which > >> can take a dataset as input as well as a "featureset" - which is > >> actually a dummy dataset with the full list of features. > >> > > > > A fingerprinter should be indeed an algorithm - this is how OpenTox API > is > > designed. Any processing should be instantiated as an algorithm. > > In your case the fingerprinter is a model .... > No, the fingerprinter itself (/algorithm/mcss ) is ot:Algorithm. Only after it is applied to specific dataset, it becomes a model of exactly that dataset. Well, my point of view is that an algorithm applied to specific data with specific parameters should be considered a model (descriptor calculations included). An algorithm is just abstract sequence of steps, when one applies it to data with specific parameters, then a model is generated. This will make the API much more consistent (now some algorithms generate a model, and results of other algorithms is a dataset, which is quite confusing for external developers). But at this point I am not insisting on changing the API that far ;) Regards, Nina > > > Regards > Surajit > > > Regards, > > Nina > > > > > >> > >> Cheers > >> Surajit > >> > >> > >> On 29 November 2010 14:05, Nina Jeliazkova <jeliazkova.nina at gmail.com> > >> wrote: > >> > Dear Christoph, Surajit, All, > >> > > >> > This discussion is very useful. > >> > > >> > As a result of myself trying to understand both points of view, now > we > >> have > >> > MCSS algorithm as ambit service (thanks to CDK SMSD package). > >> > > >> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss > >> > > >> > It can be applied to a dataset and generates a model, where predicted > >> > features (MCSS in this case) are available via ot:predictedVariables > >> > (example > https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted) > >> > The features use current API, without any change (although having > >> > ot:Substructure subclass of ot:Feature will make it more clear). > >> > > >> > All the MCSS substructures can be used by any learning algorithm , as > >> they > >> > are standard ot:Features. > >> > > >> > Here are more details and proposal (start from *Substructure API > proposal > >> > heading *) > >> > > >> > http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist > >> > > >> > Best regards, > >> > Nina > >> > > >> > P.S. Please note the /mcss algorithm might be slow for large datasets, > >> there > >> > are several improvements that we'll be applying performance wise, but > >> this > >> > will not change the API . > >> > > >> > On 25 November 2010 18:13, Christoph Helma <helma at in-silico.ch> > wrote: > >> > > >> >> Excerpts from surajit ray's message of Thu Nov 25 14:49:19 +0100 > 2010: > >> >> > >> >> > > This type of representation (we are using it internally) has > served > >> >> well > >> >> > > for our datasets which might contain also several (10-100) > thousand > >> >> > > substructures for a few thousands compounds. I also do not think, > >> that > >> >> > > the representation is redundant: > >> >> > > - each compound is represented once > >> >> > > - each substructure is represented once > >> >> > > - each association between compound and substructure is > >> >> represented once > >> >> > > Please correct me, if I am missing something obvious. > >> >> > > >> >> > According to this representation each dataEntry for a compound will > >> >> > have to have all substructure features that were found in them. > >> >> > Therefore each dataEntry may have 1000-10000 feature/featureValue > >> >> > pairs . For 500 datasentries that means on an average of > >> >> > 500*5000(assuming 5000 substructures) = 2,500,000 > feature/featureValue > >> >> > pairs - thats 2.5 million ! > >> >> > >> >> In our case it is a lot less (not completely sure about your feature > >> >> types), because only a very small subset of features occurs in a > single > >> >> compound. > >> >> > >> >> > versus just having a featureset with a > >> >> > 5000 feature entries. You can imagine the difference in cost of > >> >> > bandwidth,computation etc. > >> >> > >> >> I am not sure, if I get you right, but where do you want to store the > >> >> relationships between features and compounds? If there are really 2.5 > >> >> million associations you have to assert them somewhere. And having > >> features > >> >> without compounds seems to be quite useless for me. > >> >> > >> >> > > > >> >> > > Adding "false" occurences would not violate the current API (but > >> would > >> >> > > add redundant information). Keep in mind that the dataset > >> >> representation > >> >> > > is mainly for exchanging datasets between services - internally > you > >> can > >> >> > > use any datastructure that is efficient for your purposes (we > also > >> do > >> >> > > that in our services). So if you need fingerprints internally, > >> extract > >> >> > > them from the dataset. > >> >> > > >> >> > Internalizing an intermediate step completely serves the purpose > but > >> >> > leads to less flexible design paradigms. If we internalize the > >> >> > workflow from substructure extraction to fingerprinting - we will > lose > >> >> > the ability to provide the data to a third party server for an > >> >> > independent workflow. Of course the reasoning could be "who needs > it > >> >> > ?" - well you never know !! > >> >> > >> >> I am very interested in exchanging "fingerprints" with other > services, > >> >> but that can be done already with the current API. I see fingerprints > as > >> >> sets of features that are present in a compound (also using set > >> >> operations to calculate similarities), and find it fairly > >> >> straightforward to parse/serialize them to/from datasets. > >> >> > >> >> > > >> >> > >> I still suggest having a FeatureSet/SubstructureSet type object > >> within > >> >> > >> the API to make it convenient to club features without compound > >> >> > >> representations. > >> >> > > > >> >> > > I prefer to keep the API as generic as possible and not to > introduce > >> >> > > ad-hoc objects (or optimizations) for special purposes - > otherwise > >> it > >> >> > > will be difficult to maintain services in the long term. Why > don't > >> you > >> >> > > use ontologies for grouping features? > >> >> > > >> >> > Grouping features using ontologies is clubbing the features Not the > >> >> > feature values > >> >> > >> >> But you cannot have feature values without relating features to > >> >> compounds. If you use the representation I proposed feature values > are > >> >> "true" anyway. > >> >> > >> >> > So how do we know mcss3 occuring in compound X is with respect to > >> >> > which compound. As you said we can have arbitary fields in the > feature > >> >> > definitions (for MCSS) - but that would be outside API definitions. > >> >> > >> >> features: > >> >> mcss3: > >> >> ot:componds: > >> >> - compound2 > >> >> - compound3 > >> >> ot:smarts: smarts3 > >> >> > >> >> In my understanding you can add any annotation you want to a feature. > >> >> > >> >> > >> > Yes, you can, but if this is not an agreed annotation, no other > service > >> > will understand it. > >> > > >> > Best regards, > >> > Nina > >> > > >> > > >> >> Best regards, > >> >> Christoph > >> >> _______________________________________________ > >> >> Development mailing list > >> >> Development at opentox.org > >> >> http://www.opentox.org/mailman/listinfo/development > >> >> > >> > _______________________________________________ > >> > Development mailing list > >> > Development at opentox.org > >> > http://www.opentox.org/mailman/listinfo/development > >> > > >> _______________________________________________ > >> Development mailing list > >> Development at opentox.org > >> http://www.opentox.org/mailman/listinfo/development > >> > > _______________________________________________ > > Development mailing list > > Development at opentox.org > > http://www.opentox.org/mailman/listinfo/development > > > > > > -- > Surajit Ray > Partner > www.rareindianart.com > _______________________________________________ > Development mailing list > Development at opentox.org > http://www.opentox.org/mailman/listinfo/development >
- Previous message: [OTDev] Should AA take place before the creation of a task?
- Next message: [OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list