[OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms
surajit ray mr.surajit.ray at gmail.comThu Dec 2 14:34:08 CET 2010
- Previous message: [OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms
- Next message: [OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Nina, To organise the discussions better, I have created a new page to capture the discussion on featuresets. http://opentox.org/dev/apis/api-1.2/featureset-and-workarounds I have moved your discussion points on to this page from the substructure wishlist page .... On 30 November 2010 18:43, Nina Jeliazkova <jeliazkova.nina at gmail.com> wrote: > No problem to extend the API to be able to group features. (In fact we have > this implemented , even with hierarchical grouping > http://apps.ideaconsult.net:8080/ambit2/template/Taxonomy , turns to be > quite useful for ToXML representation ). Could be documented and included in > the API. Yeah lets have featuresets please ... > IMHO there is no sense to assign a feature or a featureset to a dataset , > without specifying what is the relationship between dataset and features. > This is perfectly served by algorithm/model approach so far. > We can capture an explicit relationship in a "FeaturesetValue" every time we assign a "Featureset" to a dataset. Its explicit, simpler and we can even put the URI of the creating algorithm in the FeaturesetValue. > The dummy dataset suggestion is a hack , which lack consistency and I am not > in favour of it. I guess the same can be said of Christoph's method of assigning substructures to compounds to actually just capture the substructure set in one dataset. On the flip side - every API that I have worked with (Google Maps API, Facebook API, Facebook Graphs API, Flex/Flash) has a "hack" which became the norm. IHMO in this case though the best solution is to have a Featureset with a FeaturesetValue to explicitly outline the relationship to the dataset. > > See above for sets of features. What we would like to have more than other > libraries in OpenTox is to be able to tell how these features have been > calculated. > > The CDK fingerprinter ( if you mean > org.openscience.cdk.fingerprint.Fingerprinter ) is not a good example here, > since it uses hashed fingerprints, which is almost impossible to translate > to SMARTS. > > The CDK does very good job for specifying descriptors metadata via > ontology, but this is not (yet?) done for fingerprinting (as far as I know), > although fingerprinter algorithm could be included in BlueObelisk or ChemInf > ontology the same way descriptor algorithms are. > > > > Sorry, this was my impression from earlier discussions. > > Look at my examples, this is exactly what comes from the MCSS model. > > Having single compound with substructures, assigned as features is > inconsistent for the following reason. > > The meaning of the set of substructures is that they have been obtained by > MCSS (or fingerprinting algorithm) , and are MCSS structures for the entire > dataset. Assigning them to a single dummy compound means all this > information is lost. W.R.T the hack - yes we lose the information - and yet its many times simpler than creating a model just to represent a set of features (substructures). A featureset with featuresetValue solves this problem well, without resorting to needless model building step or the "hack". > > > Having fragments submitted to another fingerprinter algorithm, which by > definition works on whole compounds is essentially mixing substructures with > compounds. What if the fingerprinter algorithm starts to normalize the > fragments as if they are compounds? The fingerprinter in this case is going to take two inputs - dataset and featureset(substructureset). Again wheres the question of mixing the two ? > >> > > curl -X GET /model/id/predicted gives you list of features, URLs or RDF > .There is no need to extract anything. We could easily add a new mime type > to support SMARTS for feature representation (whenever relevant) and you'll > get list of smarts by something like > > curl -X GET -H "Accept:chemical/x-smarts" /model/id/predicted > > (Hm, is there MIME format for SMARTS ) > > Besides, the current scheme supports ANY kind of fingerprinter, regardless > if it extracts fragments in the form of SMILES/SMARTS or just report some > encoded strings (as PubChem fingerprinter) or un-interpretable bits (as > hashed fingerprints). Smarts Mime type on the model/id/predicted URL can provide a list of features - but it is a non-generic way of representing the set of features. It is non-generic to imagine that a fingerprinting algorithm takes a model/id/predicted as input - especially since the "model" may not have any relationship with the fingerprinter. > > No, not a dataset comprising of features, but a dataset, comprising of > compounds. > > You could define an algorithm to have list of features as input into a new > dataset of compounds , if there is meaningful way to do so (e.g. for smarts > based features). This means there will be no assumption features are > compounds, but a documented service that does the conversion in a known way. Why would we need to convert features to compounds ? > > >> >> > curl -X POST /algorithm/features2dataset -d >> > "feature_uris[]=/model/mcss1/predicted" -> >> > /dataset/newdatasetfromfeatures >> > >> > Then you are done, POST the dataset into other algorithms as usual. >> >> I am sorry but I could not understand how a dataset will be created in >> this case. >> >> > Setup an algorithm service, which will read the features, find if a feature > is a substructure, generate compounds for them (e.g. SDF file) and post > the SDF content to a dataset service - thus it will create a dataset. Again I could not get the requirement for such a service - I would like to capture "substructures" (as features) not convert them to compounds. > > As well for individual compounds. > > > > IMHO it is not easy to understand what means "assigning" feature set to a > dataset? It doesn't tell neither where the feature set came from, nor how > it is related to a dataset. This is all lost of information, which all > contributes to the poor reproducibility of any models. Yeah a featuresetValue to capture that explicit value of the relationship is just what is needed here. > So far OpenTox has a very simple and logical API (yes, I have heard this > from external developers) - Datasets are processed by algorithms/models and > written to datasets - that's all basically. > And yet we you want to build this complex logic of making a Model to essentially store a set of features ? > > > > Just need a feature set , regardless of how it was obtained ... that's how > irreproducible cheminformatics models are born... Most industry chemoinformatics is quite irreproducable - but is paid for an viable ! Also the onus to reproduce the results are with the user - not with the datasets and algorithms in the services. > > > If you use dummy dataset with dummy compounds, you are introducing a mess > into datasets service. Because those dummy compounds and features, which > are not really features for that compound will appear as a result of > searches , hitting that compound. So essentially you are saying we cannot have any datasets with some feature values as "false" or 0 to denote absence in a compound ? > >> Ideally however I still maintain its important to have featuresets. >> > > Feature sets alone are fine, see above. Then lets go for it. We have thought a lot about it and the indirect methods suggested till now - just seem to be an attempt to stonewall any big changes in API. Which from the perspective of an API is hara kiri ? Google Maps API released 3 major versions in 5 years. By comparison our API upgrades are just moving from 1.1 to 1.2 in 2 years, with barely any changes ... > > A generic API means it could be applied to great variety of problems. > Specific solutions introduce incompatibility. > It's a generic computer science approach - try to abstract things, break > larger problem into smaller pieces, find the commonalities. That's how IT > works ... So are we doing chemoinformatics (vs just toxicity) ? Are we representing atomic features ? If both your answers are "no" then we can safely say we have not abstracted enough ! > > Having your simple information represented in a way specific for your > problem doesn't make things compatible ... Now everybody could use > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss to retrieve MCSS > structures from dataset of their choice and then run it through any of Weka > algorithms available by any partner. Will your approach do anything similar > ? I can't see a point to debate here ... I am looking for a generic solution for collecting features. This can be achieved quite simply by having a featureset and a featuresetValue (when assigning to a dataset) to explicitly capture the value of the relationship. The beauty is we do not need a "model" in the middle to just capture some explicit relationships. Regards Surajit > Regards, > Nina > > >> >> Regards >> Surajit >> >> >> > Well, my point of view is that an algorithm applied to specific data with >> > specific parameters should be considered a model (descriptor calculations >> > included). An algorithm is just abstract sequence of steps, when one >> > applies it to data with specific parameters, then a model is generated. >> > This will make the API much more consistent (now some algorithms generate >> a >> > model, and results of other algorithms is a dataset, which is quite >> > confusing for external developers). But at this point I am not insisting >> on >> > changing the API that far ;) >> > >> > Regards, >> > Nina >> > >> > >> > >> > >> >> >> >> >> >> Regards >> >> Surajit >> >> >> >> > Regards, >> >> > Nina >> >> > >> >> > >> >> >> >> >> >> Cheers >> >> >> Surajit >> >> >> >> >> >> >> >> >> On 29 November 2010 14:05, Nina Jeliazkova < >> jeliazkova.nina at gmail.com> >> >> >> wrote: >> >> >> > Dear Christoph, Surajit, All, >> >> >> > >> >> >> > This discussion is very useful. >> >> >> > >> >> >> > As a result of myself trying to understand both points of view, >> now >> >> we >> >> >> have >> >> >> > MCSS algorithm as ambit service (thanks to CDK SMSD package). >> >> >> > >> >> >> > https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss >> >> >> > >> >> >> > It can be applied to a dataset and generates a model, where >> predicted >> >> >> > features (MCSS in this case) are available via >> ot:predictedVariables >> >> >> > (example >> >> https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted) >> >> >> > The features use current API, without any change (although having >> >> >> > ot:Substructure subclass of ot:Feature will make it more clear). >> >> >> > >> >> >> > All the MCSS substructures can be used by any learning algorithm , >> as >> >> >> they >> >> >> > are standard ot:Features. >> >> >> > >> >> >> > Here are more details and proposal (start from *Substructure API >> >> proposal >> >> >> > heading *) >> >> >> > >> >> >> > http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist >> >> >> > >> >> >> > Best regards, >> >> >> > Nina >> >> >> > >> >> >> > P.S. Please note the /mcss algorithm might be slow for large >> datasets, >> >> >> there >> >> >> > are several improvements that we'll be applying performance wise, >> but >> >> >> this >> >> >> > will not change the API . >> >> >> > >> >> >> > On 25 November 2010 18:13, Christoph Helma <helma at in-silico.ch> >> >> wrote: >> >> >> > >> >> >> >> Excerpts from surajit ray's message of Thu Nov 25 14:49:19 +0100 >> >> 2010: >> >> >> >> >> >> >> >> > > This type of representation (we are using it internally) has >> >> served >> >> >> >> well >> >> >> >> > > for our datasets which might contain also several (10-100) >> >> thousand >> >> >> >> > > substructures for a few thousands compounds. I also do not >> think, >> >> >> that >> >> >> >> > > the representation is redundant: >> >> >> >> > > - each compound is represented once >> >> >> >> > > - each substructure is represented once >> >> >> >> > > - each association between compound and substructure is >> >> >> >> represented once >> >> >> >> > > Please correct me, if I am missing something obvious. >> >> >> >> > >> >> >> >> > According to this representation each dataEntry for a compound >> will >> >> >> >> > have to have all substructure features that were found in them. >> >> >> >> > Therefore each dataEntry may have 1000-10000 >> feature/featureValue >> >> >> >> > pairs . For 500 datasentries that means on an average of >> >> >> >> > 500*5000(assuming 5000 substructures) = 2,500,000 >> >> feature/featureValue >> >> >> >> > pairs - thats 2.5 million ! >> >> >> >> >> >> >> >> In our case it is a lot less (not completely sure about your >> feature >> >> >> >> types), because only a very small subset of features occurs in a >> >> single >> >> >> >> compound. >> >> >> >> >> >> >> >> > versus just having a featureset with a >> >> >> >> > 5000 feature entries. You can imagine the difference in cost of >> >> >> >> > bandwidth,computation etc. >> >> >> >> >> >> >> >> I am not sure, if I get you right, but where do you want to store >> the >> >> >> >> relationships between features and compounds? If there are really >> 2.5 >> >> >> >> million associations you have to assert them somewhere. And having >> >> >> features >> >> >> >> without compounds seems to be quite useless for me. >> >> >> >> >> >> >> >> > > >> >> >> >> > > Adding "false" occurences would not violate the current API >> (but >> >> >> would >> >> >> >> > > add redundant information). Keep in mind that the dataset >> >> >> >> representation >> >> >> >> > > is mainly for exchanging datasets between services - >> internally >> >> you >> >> >> can >> >> >> >> > > use any datastructure that is efficient for your purposes (we >> >> also >> >> >> do >> >> >> >> > > that in our services). So if you need fingerprints internally, >> >> >> extract >> >> >> >> > > them from the dataset. >> >> >> >> > >> >> >> >> > Internalizing an intermediate step completely serves the purpose >> >> but >> >> >> >> > leads to less flexible design paradigms. If we internalize the >> >> >> >> > workflow from substructure extraction to fingerprinting - we >> will >> >> lose >> >> >> >> > the ability to provide the data to a third party server for an >> >> >> >> > independent workflow. Of course the reasoning could be "who >> needs >> >> it >> >> >> >> > ?" - well you never know !! >> >> >> >> >> >> >> >> I am very interested in exchanging "fingerprints" with other >> >> services, >> >> >> >> but that can be done already with the current API. I see >> fingerprints >> >> as >> >> >> >> sets of features that are present in a compound (also using set >> >> >> >> operations to calculate similarities), and find it fairly >> >> >> >> straightforward to parse/serialize them to/from datasets. >> >> >> >> >> >> >> >> > >> >> >> >> > >> I still suggest having a FeatureSet/SubstructureSet type >> object >> >> >> within >> >> >> >> > >> the API to make it convenient to club features without >> compound >> >> >> >> > >> representations. >> >> >> >> > > >> >> >> >> > > I prefer to keep the API as generic as possible and not to >> >> introduce >> >> >> >> > > ad-hoc objects (or optimizations) for special purposes - >> >> otherwise >> >> >> it >> >> >> >> > > will be difficult to maintain services in the long term. Why >> >> don't >> >> >> you >> >> >> >> > > use ontologies for grouping features? >> >> >> >> > >> >> >> >> > Grouping features using ontologies is clubbing the features Not >> the >> >> >> >> > feature values >> >> >> >> >> >> >> >> But you cannot have feature values without relating features to >> >> >> >> compounds. If you use the representation I proposed feature values >> >> are >> >> >> >> "true" anyway. >> >> >> >> >> >> >> >> > So how do we know mcss3 occuring in compound X is with respect >> to >> >> >> >> > which compound. As you said we can have arbitary fields in the >> >> feature >> >> >> >> > definitions (for MCSS) - but that would be outside API >> definitions. >> >> >> >> >> >> >> >> features: >> >> >> >> mcss3: >> >> >> >> ot:componds: >> >> >> >> - compound2 >> >> >> >> - compound3 >> >> >> >> ot:smarts: smarts3 >> >> >> >> >> >> >> >> In my understanding you can add any annotation you want to a >> feature. >> >> >> >> >> >> >> >> >> >> >> > Yes, you can, but if this is not an agreed annotation, no other >> >> service >> >> >> > will understand it. >> >> >> > >> >> >> > Best regards, >> >> >> > Nina >> >> >> > >> >> >> > >> >> >> >> Best regards, >> >> >> >> Christoph >> >> >> >> _______________________________________________ >> >> >> >> Development mailing list >> >> >> >> Development at opentox.org >> >> >> >> http://www.opentox.org/mailman/listinfo/development >> >> >> >> >> >> >> > _______________________________________________ >> >> >> > Development mailing list >> >> >> > Development at opentox.org >> >> >> > http://www.opentox.org/mailman/listinfo/development >> >> >> > >> >> >> _______________________________________________ >> >> >> Development mailing list >> >> >> Development at opentox.org >> >> >> http://www.opentox.org/mailman/listinfo/development >> >> >> >> >> > _______________________________________________ >> >> > Development mailing list >> >> > Development at opentox.org >> >> > http://www.opentox.org/mailman/listinfo/development >> >> > >> >> >> >> >> >> >> >> -- >> >> Surajit Ray >> >> Partner >> >> www.rareindianart.com >> >> _______________________________________________ >> >> Development mailing list >> >> Development at opentox.org >> >> http://www.opentox.org/mailman/listinfo/development >> >> >> > _______________________________________________ >> > Development mailing list >> > Development at opentox.org >> > http://www.opentox.org/mailman/listinfo/development >> > >> >> >> >> -- >> Surajit Ray >> Partner >> www.rareindianart.com >> _______________________________________________ >> Development mailing list >> Development at opentox.org >> http://www.opentox.org/mailman/listinfo/development >> > _______________________________________________ > Development mailing list > Development at opentox.org > http://www.opentox.org/mailman/listinfo/development >
- Previous message: [OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms
- Next message: [OTDev] Datasets with Features for multi entity relationships ? Models & Algorithms
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list