[OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms
Nina Jeliazkova jeliazkova.nina at gmail.comThu Dec 2 14:44:39 CET 2010
- Previous message: [OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms
- Next message: [OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Surajit, "Most industry chemoinformatics is quite irreproducable - but is paid for an viable ! Also the onus to reproduce the results are with the user - not with the datasets and algorithms in the services." This is exactly what we are struggling to overcome in OpenTox - thus, if we don't agree on this point, I don't see the point of the discussion and will leave the consensus for others. Having a hack for something which is demonstrated to have a solution with the current API doesn't make sense to me (IMHO, IMHO). Regards, Nina On 2 December 2010 15:34, surajit ray <mr.surajit.ray at gmail.com> wrote: > Hi Nina, > > To organise the discussions better, I have created a new page to > capture the discussion on featuresets. > http://opentox.org/dev/apis/api-1.2/featureset-and-workarounds > > I have moved your discussion points on to this page from the > substructure wishlist page .... > > On 30 November 2010 18:43, Nina Jeliazkova <jeliazkova.nina at gmail.com> > wrote: > > No problem to extend the API to be able to group features. (In fact we > have > > this implemented , even with hierarchical grouping > > http://apps.ideaconsult.net:8080/ambit2/template/Taxonomy , turns to be > > quite useful for ToXML representation ). Could be documented and included > in > > the API. > > Yeah lets have featuresets please ... > > > IMHO there is no sense to assign a feature or a featureset to a dataset , > > without specifying what is the relationship between dataset and features. > > This is perfectly served by algorithm/model approach so far. > > > > We can capture an explicit relationship in a "FeaturesetValue" every > time we assign a "Featureset" to a dataset. Its explicit, simpler and > we can even put the URI of the creating algorithm in the > FeaturesetValue. > > > The dummy dataset suggestion is a hack , which lack consistency and I am > not > > in favour of it. > > I guess the same can be said of Christoph's method of assigning > substructures to compounds to actually just capture the substructure > set in one dataset. On the flip side - every API that I have worked > with (Google Maps API, Facebook API, Facebook Graphs API, Flex/Flash) > has a "hack" which became the norm. IHMO in this case though the best > solution is to have a Featureset with a FeaturesetValue to explicitly > outline the relationship to the dataset. > > > > > > See above for sets of features. What we would like to have more than > other > > libraries in OpenTox is to be able to tell how these features have been > > calculated. > > > > The CDK fingerprinter ( if you mean > > org.openscience.cdk.fingerprint.Fingerprinter ) is not a good example > here, > > since it uses hashed fingerprints, which is almost impossible to > translate > > to SMARTS. > > > > The CDK does very good job for specifying descriptors metadata via > > ontology, but this is not (yet?) done for fingerprinting (as far as I > know), > > although fingerprinter algorithm could be included in BlueObelisk or > ChemInf > > ontology the same way descriptor algorithms are. > > > > > > > > > Sorry, this was my impression from earlier discussions. > > > > Look at my examples, this is exactly what comes from the MCSS model. > > > > Having single compound with substructures, assigned as features is > > inconsistent for the following reason. > > > > The meaning of the set of substructures is that they have been obtained > by > > MCSS (or fingerprinting algorithm) , and are MCSS structures for the > entire > > dataset. Assigning them to a single dummy compound means all this > > information is lost. > > W.R.T the hack - yes we lose the information - and yet its many times > simpler than creating a model just to represent a set of features > (substructures). > > A featureset with featuresetValue solves this problem well, without > resorting to needless model building step or the "hack". > > > > > > > Having fragments submitted to another fingerprinter algorithm, which by > > definition works on whole compounds is essentially mixing substructures > with > > compounds. What if the fingerprinter algorithm starts to normalize the > > fragments as if they are compounds? > > The fingerprinter in this case is going to take two inputs - dataset > and featureset(substructureset). Again wheres the question of mixing > the two ? > > > > > >> > > > > curl -X GET /model/id/predicted gives you list of features, URLs or RDF > > .There is no need to extract anything. We could easily add a new mime > type > > to support SMARTS for feature representation (whenever relevant) and > you'll > > get list of smarts by something like > > > > curl -X GET -H "Accept:chemical/x-smarts" /model/id/predicted > > > > (Hm, is there MIME format for SMARTS ) > > > > Besides, the current scheme supports ANY kind of fingerprinter, > regardless > > if it extracts fragments in the form of SMILES/SMARTS or just report some > > encoded strings (as PubChem fingerprinter) or un-interpretable bits (as > > hashed fingerprints). > > Smarts Mime type on the model/id/predicted URL can provide a list of > features - but it is a non-generic way of representing the set of > features. It is non-generic to imagine that a fingerprinting algorithm > takes a model/id/predicted as input - especially since the "model" may > not have any relationship with the fingerprinter. > > > > > > No, not a dataset comprising of features, but a dataset, comprising of > > compounds. > > > > You could define an algorithm to have list of features as input into a > new > > dataset of compounds , if there is meaningful way to do so (e.g. for > smarts > > based features). This means there will be no assumption features are > > compounds, but a documented service that does the conversion in a known > way. > > Why would we need to convert features to compounds ? > > > > > > >> > >> > curl -X POST /algorithm/features2dataset -d > >> > "feature_uris[]=/model/mcss1/predicted" -> > >> > /dataset/newdatasetfromfeatures > >> > > >> > Then you are done, POST the dataset into other algorithms as usual. > >> > >> I am sorry but I could not understand how a dataset will be created in > >> this case. > >> > >> > > Setup an algorithm service, which will read the features, find if a > feature > > is a substructure, generate compounds for them (e.g. SDF file) and > post > > the SDF content to a dataset service - thus it will create a dataset. > > Again I could not get the requirement for such a service - I would > like to capture "substructures" (as features) not convert them to > compounds. > > > > > > As well for individual compounds. > > > > > > > > > IMHO it is not easy to understand what means "assigning" feature set to > a > > dataset? It doesn't tell neither where the feature set came from, nor > how > > it is related to a dataset. This is all lost of information, which all > > contributes to the poor reproducibility of any models. > > Yeah a featuresetValue to capture that explicit value of the > relationship is just what is needed here. > > > So far OpenTox has a very simple and logical API (yes, I have heard this > > from external developers) - Datasets are processed by algorithms/models > and > > written to datasets - that's all basically. > > > > And yet we you want to build this complex logic of making a Model to > essentially store a set of features ? > > > > > > > > > Just need a feature set , regardless of how it was obtained ... that's > how > > irreproducible cheminformatics models are born... > > Most industry chemoinformatics is quite irreproducable - but is paid > for an viable ! Also the onus to reproduce the results are with the > user - not with the datasets and algorithms in the services. > > > > > > > If you use dummy dataset with dummy compounds, you are introducing a mess > > into datasets service. Because those dummy compounds and features, which > > are not really features for that compound will appear as a result of > > searches , hitting that compound. > > So essentially you are saying we cannot have any datasets with some > feature values as "false" or 0 to denote absence in a compound ? > > > > >> Ideally however I still maintain its important to have featuresets. > >> > > > > Feature sets alone are fine, see above. > > Then lets go for it. We have thought a lot about it and the indirect > methods suggested till now - just seem to be an attempt to stonewall > any big changes in API. Which from the perspective of an API is hara > kiri ? Google Maps API released 3 major versions in 5 years. By > comparison our API upgrades are just moving from 1.1 to 1.2 in 2 > years, with barely any changes ... > > > > > A generic API means it could be applied to great variety of problems. > > Specific solutions introduce incompatibility. > > It's a generic computer science approach - try to abstract things, break > > larger problem into smaller pieces, find the commonalities. That's how > IT > > works ... > > So are we doing chemoinformatics (vs just toxicity) ? Are we > representing atomic features ? If both your answers are "no" then we > can safely say we have not abstracted enough ! > > > > > Having your simple information represented in a way specific for your > > problem doesn't make things compatible ... Now everybody could use > > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss to retrieve MCSS > > structures from dataset of their choice and then run it through any of > Weka > > algorithms available by any partner. Will your approach do anything > similar > > ? > > I can't see a point to debate here ... I am looking for a generic > solution for collecting features. This can be achieved quite simply > by having a featureset and a featuresetValue (when assigning to a > dataset) to explicitly capture the value of the relationship. The > beauty is we do not need a "model" in the middle to just capture some > explicit relationships. > > > Regards > Surajit > > > Regards, > > Nina > > > > > >> > >> Regards > >> Surajit > >> > >> > >> > Well, my point of view is that an algorithm applied to specific data > with > >> > specific parameters should be considered a model (descriptor > calculations > >> > included). An algorithm is just abstract sequence of steps, when one > >> > applies it to data with specific parameters, then a model is > generated. > >> > This will make the API much more consistent (now some algorithms > generate > >> a > >> > model, and results of other algorithms is a dataset, which is quite > >> > confusing for external developers). But at this point I am not > insisting > >> on > >> > changing the API that far ;) > >> > > >> > Regards, > >> > Nina > >> > > >> > > >> > > >> > > >> >> > >> >> > >> >> Regards > >> >> Surajit > >> >> > >> >> > Regards, > >> >> > Nina > >> >> > > >> >> > > >> >> >> > >> >> >> Cheers > >> >> >> Surajit > >> >> >> > >> >> >> > >> >> >> On 29 November 2010 14:05, Nina Jeliazkova < > >> jeliazkova.nina at gmail.com> > >> >> >> wrote: > >> >> >> > Dear Christoph, Surajit, All, > >> >> >> > > >> >> >> > This discussion is very useful. > >> >> >> > > >> >> >> > As a result of myself trying to understand both points of view, > >> now > >> >> we > >> >> >> have > >> >> >> > MCSS algorithm as ambit service (thanks to CDK SMSD package). > >> >> >> > > >> >> >> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss > >> >> >> > > >> >> >> > It can be applied to a dataset and generates a model, where > >> predicted > >> >> >> > features (MCSS in this case) are available via > >> ot:predictedVariables > >> >> >> > (example > >> >> https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted) > >> >> >> > The features use current API, without any change (although > having > >> >> >> > ot:Substructure subclass of ot:Feature will make it more clear). > >> >> >> > > >> >> >> > All the MCSS substructures can be used by any learning algorithm > , > >> as > >> >> >> they > >> >> >> > are standard ot:Features. > >> >> >> > > >> >> >> > Here are more details and proposal (start from *Substructure API > >> >> proposal > >> >> >> > heading *) > >> >> >> > > >> >> >> > http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist > >> >> >> > > >> >> >> > Best regards, > >> >> >> > Nina > >> >> >> > > >> >> >> > P.S. Please note the /mcss algorithm might be slow for large > >> datasets, > >> >> >> there > >> >> >> > are several improvements that we'll be applying performance > wise, > >> but > >> >> >> this > >> >> >> > will not change the API . > >> >> >> > > >> >> >> > On 25 November 2010 18:13, Christoph Helma <helma at in-silico.ch> > >> >> wrote: > >> >> >> > > >> >> >> >> Excerpts from surajit ray's message of Thu Nov 25 14:49:19 > +0100 > >> >> 2010: > >> >> >> >> > >> >> >> >> > > This type of representation (we are using it internally) > has > >> >> served > >> >> >> >> well > >> >> >> >> > > for our datasets which might contain also several (10-100) > >> >> thousand > >> >> >> >> > > substructures for a few thousands compounds. I also do not > >> think, > >> >> >> that > >> >> >> >> > > the representation is redundant: > >> >> >> >> > > - each compound is represented once > >> >> >> >> > > - each substructure is represented once > >> >> >> >> > > - each association between compound and substructure > is > >> >> >> >> represented once > >> >> >> >> > > Please correct me, if I am missing something obvious. > >> >> >> >> > > >> >> >> >> > According to this representation each dataEntry for a > compound > >> will > >> >> >> >> > have to have all substructure features that were found in > them. > >> >> >> >> > Therefore each dataEntry may have 1000-10000 > >> feature/featureValue > >> >> >> >> > pairs . For 500 datasentries that means on an average of > >> >> >> >> > 500*5000(assuming 5000 substructures) = 2,500,000 > >> >> feature/featureValue > >> >> >> >> > pairs - thats 2.5 million ! > >> >> >> >> > >> >> >> >> In our case it is a lot less (not completely sure about your > >> feature > >> >> >> >> types), because only a very small subset of features occurs in > a > >> >> single > >> >> >> >> compound. > >> >> >> >> > >> >> >> >> > versus just having a featureset with a > >> >> >> >> > 5000 feature entries. You can imagine the difference in cost > of > >> >> >> >> > bandwidth,computation etc. > >> >> >> >> > >> >> >> >> I am not sure, if I get you right, but where do you want to > store > >> the > >> >> >> >> relationships between features and compounds? If there are > really > >> 2.5 > >> >> >> >> million associations you have to assert them somewhere. And > having > >> >> >> features > >> >> >> >> without compounds seems to be quite useless for me. > >> >> >> >> > >> >> >> >> > > > >> >> >> >> > > Adding "false" occurences would not violate the current API > >> (but > >> >> >> would > >> >> >> >> > > add redundant information). Keep in mind that the dataset > >> >> >> >> representation > >> >> >> >> > > is mainly for exchanging datasets between services - > >> internally > >> >> you > >> >> >> can > >> >> >> >> > > use any datastructure that is efficient for your purposes > (we > >> >> also > >> >> >> do > >> >> >> >> > > that in our services). So if you need fingerprints > internally, > >> >> >> extract > >> >> >> >> > > them from the dataset. > >> >> >> >> > > >> >> >> >> > Internalizing an intermediate step completely serves the > purpose > >> >> but > >> >> >> >> > leads to less flexible design paradigms. If we internalize > the > >> >> >> >> > workflow from substructure extraction to fingerprinting - we > >> will > >> >> lose > >> >> >> >> > the ability to provide the data to a third party server for > an > >> >> >> >> > independent workflow. Of course the reasoning could be "who > >> needs > >> >> it > >> >> >> >> > ?" - well you never know !! > >> >> >> >> > >> >> >> >> I am very interested in exchanging "fingerprints" with other > >> >> services, > >> >> >> >> but that can be done already with the current API. I see > >> fingerprints > >> >> as > >> >> >> >> sets of features that are present in a compound (also using set > >> >> >> >> operations to calculate similarities), and find it fairly > >> >> >> >> straightforward to parse/serialize them to/from datasets. > >> >> >> >> > >> >> >> >> > > >> >> >> >> > >> I still suggest having a FeatureSet/SubstructureSet type > >> object > >> >> >> within > >> >> >> >> > >> the API to make it convenient to club features without > >> compound > >> >> >> >> > >> representations. > >> >> >> >> > > > >> >> >> >> > > I prefer to keep the API as generic as possible and not to > >> >> introduce > >> >> >> >> > > ad-hoc objects (or optimizations) for special purposes - > >> >> otherwise > >> >> >> it > >> >> >> >> > > will be difficult to maintain services in the long term. > Why > >> >> don't > >> >> >> you > >> >> >> >> > > use ontologies for grouping features? > >> >> >> >> > > >> >> >> >> > Grouping features using ontologies is clubbing the features > Not > >> the > >> >> >> >> > feature values > >> >> >> >> > >> >> >> >> But you cannot have feature values without relating features to > >> >> >> >> compounds. If you use the representation I proposed feature > values > >> >> are > >> >> >> >> "true" anyway. > >> >> >> >> > >> >> >> >> > So how do we know mcss3 occuring in compound X is with > respect > >> to > >> >> >> >> > which compound. As you said we can have arbitary fields in > the > >> >> feature > >> >> >> >> > definitions (for MCSS) - but that would be outside API > >> definitions. > >> >> >> >> > >> >> >> >> features: > >> >> >> >> mcss3: > >> >> >> >> ot:componds: > >> >> >> >> - compound2 > >> >> >> >> - compound3 > >> >> >> >> ot:smarts: smarts3 > >> >> >> >> > >> >> >> >> In my understanding you can add any annotation you want to a > >> feature. > >> >> >> >> > >> >> >> >> > >> >> >> > Yes, you can, but if this is not an agreed annotation, no other > >> >> service > >> >> >> > will understand it. > >> >> >> > > >> >> >> > Best regards, > >> >> >> > Nina > >> >> >> > > >> >> >> > > >> >> >> >> Best regards, > >> >> >> >> Christoph > >> >> >> >> _______________________________________________ > >> >> >> >> Development mailing list > >> >> >> >> Development at opentox.org > >> >> >> >> http://www.opentox.org/mailman/listinfo/development > >> >> >> >> > >> >> >> > _______________________________________________ > >> >> >> > Development mailing list > >> >> >> > Development at opentox.org > >> >> >> > http://www.opentox.org/mailman/listinfo/development > >> >> >> > > >> >> >> _______________________________________________ > >> >> >> Development mailing list > >> >> >> Development at opentox.org > >> >> >> http://www.opentox.org/mailman/listinfo/development > >> >> >> > >> >> > _______________________________________________ > >> >> > Development mailing list > >> >> > Development at opentox.org > >> >> > http://www.opentox.org/mailman/listinfo/development > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> Surajit Ray > >> >> Partner > >> >> www.rareindianart.com > >> >> _______________________________________________ > >> >> Development mailing list > >> >> Development at opentox.org > >> >> http://www.opentox.org/mailman/listinfo/development > >> >> > >> > _______________________________________________ > >> > Development mailing list > >> > Development at opentox.org > >> > http://www.opentox.org/mailman/listinfo/development > >> > > >> > >> > >> > >> -- > >> Surajit Ray > >> Partner > >> www.rareindianart.com > >> _______________________________________________ > >> Development mailing list > >> Development at opentox.org > >> http://www.opentox.org/mailman/listinfo/development > >> > > _______________________________________________ > > Development mailing list > > Development at opentox.org > > http://www.opentox.org/mailman/listinfo/development > > > _______________________________________________ > Development mailing list > Development at opentox.org > http://www.opentox.org/mailman/listinfo/development >
- Previous message: [OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms
- Next message: [OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list