[OTDev] descriptor recalculation
Christoph Helma helma at in-silico.chThu Apr 29 23:33:21 CEST 2010
- Previous message: [OTDev] descriptor recalculation
- Next message: [OTDev] descriptor recalculation
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Excerpts from Nina Jeliazkova's message of Thu Apr 29 20:04:17 +0200 2010: > Dear Christoph, > > Few comments from my point of view. I (partially) agree, the difference > perhaps is I would like to keep both the "low level" services and > "super" services aboard, not hiding everything into "super" services. I agree, that the "low level" services should be still available for everyone - that would make it much easier to create innovative "super" services and it is also necessary for other tasks apart from creating models and predictions. > The examples are of course valid, however these are actually not single > services, but workflows of services, rather simple ones. > > What we have right now are actually building blocks of services, and > composition of these are being build case by case. For example the > mlr case might be > > training-dataset = data_preprocessing(raw-dataset) > descriptors_dataset1 = cdk-descriptors(training-dataset) > descriptors_dataset2 = other-descriptors(training-dataset) > descriptors_dataset = merge(descriptors_dataset1,descriptors_dataset2) > pca-descriptors = pca(descriptors) > mlr-model = mlr(pca-descriptors) > model.model = mlr-model > model.descriptors = pca-descriptors > > (note that outcome of calculations are datasets again, so in short we have > > training-dataset = data_preprocessing(raw-dataset) > model = mlr(merge(cdk-descriptors(training-dataset),other-descriptors(training-dataset))) > > > There could be multiple services in the chain, and it might not be > sufficient just to submit services as parameters, there should be a way > to specify the order of the processing. Agreed. Any ideas how to represent sequences? > > However, it should be fine to wrap such an workflow in a "super" service. > > The model learning task and the prediction task may utilize one or more > > algorithms (or models - the separation blurs once again), but at the > > high level I want to use only the "super" algorithms/models. > > > For clear separation, I would like to propose the following > interpretation - > > - "algorithm" means a generic "algorithm", or series of steps > - "model" is the "algorithm", applied to the data , given relevant > parameters. > > Thus, in kNN example, the "algorithm" is the theoretical procedure of > finding the closest k neighbors and generating prediction based on its > values; > while the "model" is the kNN algorithm, applied to specific data set, > distance metric and equation/procedure (e.g. averaging values). > > For eager learning algorithms there should not be confusion. Thanks for clarification - hope I memorize it this time ;-) > > As a GUI developer I still want to have access to the underlying > > algorithms, but they can be provided as parameters (our existing API is > > quite flexible in this respect). An algorithm webservice could provide > > e.g. a high level regression algorithm that allows to choose descriptor > > calculation, feature selection and modelling parameters by setting > > parameters (and it should document and check internally which algorithms > > play together). Future lazar version e.g. will have the facility to > > freely switch descriptor calculation services or use datasets with > > biological measurements. Maybe we should add the facility to represent > > sub-algorithms in OWL-DL for "super" algorithms. > > > > According to our API the model knows about ot.Algorithm and > > ot.IndependentVariables, but it would need to know the service to > > calculate independent variables. > It does actually - every feature (variable) has ot:hasSource, which > points to the service it has been generated from (e.g. descriptor > calculation one) - and this is what we use in ToxPredict. True, but that makes sense only for "simple" descriptor calculation algorithms (i.e. descriptors that are independent of the training activities, like phys-chem properties, substructures). If we use e.g. supervised graph mining techiques we need (i) an algorithm (model because it is algorithm applied to data?) that mines features in the training dataset and creates a feature dataset (e.g. fminer) (ii) a simple substructure matching algorithm that determines if the mined features are present in the compound to be predicted (e.g. OpenBabel Smarts matcher) My interpretation was, that ot:hasSource should point to the graph mining algorithm, but the model would need the substructure matcher for predictions. How should we handle this? > > This could be inferred from the > > ot.Algorithms's sub-algorithms or stated explicitly. More importantly > > the service would have to be able to call the necessary services > Yes it already does - get the independent variable ot:hasSource > property, run POST and get the result > > (of > > course this has to be implemented, if you are using stock ML/DM tools - > > but OpenTox should be more than just wrapping existing programs into a > > REST interface). It would be a large waste of efforts, if every > > developer would have to implement descriptor calculation separatly in > > their webservice clients. > > > Agree. > > But it will be also a waste, if e.g. descriptor calculations services > are hidden within model services , and not accessible for reuse for > other models. I do not want to hide them completely - on the contrary, everyone should be able to mash up descriptor calculation/selection and model learning services. I was just arguing in favor of better encapsulation, that makes programming/experimenting easier (e.g. models that accept structures instead of descriptor sets as input for predictions). > For example descriptors, currently used in lazar don't seem to be > accessible for use in other services (I might just not be aware though), > while descriptors, exposed as separate services by TUM and IDEA are. Descriptor calculation is currently performed by fminer, which is available (and used by lazar) as an independent standalone service: http://webservices.in-silico.ch/algorithm/fminer It can be easily exchanged for other descriptor services. On the other hand Andreas uses e.g. fminer/last features to create SVM models. O> > To sum up my personal opinion: > > > > For ToxCreate I would like to handle to high-level objects/services: > > training-algorithm (for creating models) and model (for predictions). I > > do not want to have to care about implementation details for model > > training and predictions, but would like to have access to the > > underlying algorithms through parameters. > Or links from RDF representation of related objects. Might be also a possibility to represent sequences of steps. > > We might need minor API > > changes for representing "super" algorithm services (i.e. algorithm > > services that call other algorithm sservices) and for informing the > > model service about the right descriptor calculation service. > > > I am also comfortable with the idea of having "super" (proxy, composite > ) services, to encapsulate workflows, like in your examples. > > BTW "super" service sounds better than "proxy" service I was suggesting > the other day. Thanks for the compliment, it was the first thing that came into my mind. Best regards, Christoph
- Previous message: [OTDev] descriptor recalculation
- Next message: [OTDev] descriptor recalculation
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list