[OTDev] descriptor recalculation
Nina Jeliazkova nina at acad.bgWed Apr 28 11:07:07 CEST 2010
- Previous message: [OTDev] descriptor recalculation
- Next message: [OTDev] descriptor recalculation
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello All, There is an updated REST tutorial http://www.inf.usi.ch/faculty/pautasso/lectures/REST-Tutorial-WWW2010.pdf <http://www.inf.usi.ch/faculty/pautasso/lectures/REST-Tutorial-WWW2010.pdf> from the WS-REST-2010 workshop http://www.ws-rest.org/ which took place couple of days ago. It seems to me Part 4. Composite resources (from slide 88 ) is quite relevant to the discussion in this thread . Best regards, Nina Nina Jeliazkova wrote: > Hi, > > Martin Guetlein wrote: > >> On Tue, Apr 20, 2010 at 7:52 PM, Nina Jeliazkova <nina at acad.bg> wrote: >> >> >> >>> Hi Tobias, All, >>> >>> I am trying to think of API extension/change , necessary to include the >>> intermediate descriptor calculation service (preferably without making >>> it mandatory) ? >>> >>> My suggestions: >>> 1) For models and algorithms, there might be an (optional) parameter, >>> pointing to an URI of the new calculation service. If the parameter is >>> missing and descriptor calculation is necessary, the model either >>> initiates the calculations itself, or returns an error. >>> >>> 2) The API of the new calculation service (is the 'recalculation' right >>> name for it - perhaps 'proxy calculation service' has a closer meaning ?) >>> >>> GET: RDF representation of ot:Algorithm object; with algorithm type as >>> in 3) >>> >>> POST >>> parameters: >>> dataset_uri - as usual >>> model_uri - the uri of the model >>> or (alternatively) algorithm_uris[] - list of descriptor calculation >>> algorithms >>> >>> POST result - returns URI of the dataset with calculated descriptors >>> >>> 3) eventually include a new type in Algorithms type ontology for the >>> proxy calculation service; and use it in the RDF representation of the >>> proxy service; thus one could find if such a service exist within the >>> list of algorithms. >>> >>> >>> Will this be sufficient ? Not sure if I am not missing something >>> important, discussion welcome! >>> >>> Best regards, >>> Nina >>> >>> >>> >>> >> Hello all, >> >> >> I just discussed the 'PCA-problem' that we discovered during the gotomeeting >> with Andreas M. Our proposal requires further API changes, but it should >> work. >> >> ------------------------------------ >> >> Problem description: >> (We use Variant 2 from Tobias proposal.) >> >> Step 1 - Client builds model with params: >> * training-dataset (only compounds, no features) >> * feature-mining-algorithm >> >> > I guess you mean descriptor calculation here. > >> * feature-selection-algorithm (PCA) >> >> The model service passes the params to the feature (re)calculation service. >> The feature-recalculation-service first applies the >> feature-mining-algorithm, then applies the feature-selection-algorithm. The >> feature-recalculation-service passes a new feature dataset back to the model >> service. The model service builds the model based on the feature dataset. >> The model service passes the model-uri to the client. >> >> >> > This will not quite fit for algorithms, which require no structures, but > directly descriptors. > >> Step 2 - Client predicts a compound with the model >> * uri: model-uri >> * param: test compound (plain compounds, no features) >> >> The model service passes the feature mining params (stored in the model???) >> >> > independent variables are already stored in the model representation > under ot:independentVariables property. > >> to the (re)calculation service to calculate the features for the test >> compound. >> PROBLEM: The feature-selection-algorithm PCA needs information from the >> original training dataset to apply the same mechanism to the test features! >> >> ------------------------------------ >> >> Proposal: >> * Distinguish between feature selection Algorithms and Models. >> >> > OK > >> * Distinguish between a 'mine-feature-step' (on training data) and a >> 'check-feature-step' (test-compound). >> >> > Why - in both cases, the result is generating the descriptors , be it > from some mining procedure , or reading pre-calculated descriptors from > elsewhere. > "Mining" is not always the right word,especially for the common class of > models, using calculated whole molecule descriptors. > >> * The 'mine-feature-step' is available from the algorithm and builds the >> model (as well as the features for the dataset) >> >> > This removes the separation of learning algorithms and descriptor > calculations, which is not a good approach IMHO. > >> * The 'check-feature-step' is availbale for models only. >> * Feature selection algorithms, where 'checking' is independent of 'mining', >> are per definition models (like chi-square-filter). >> >> >> > Yes indeed, feature selection algorithms like PCA might be considered > models , because the result from such a model will be a dataset. > However, other feature selection algorithm simply return list of feature > URIs. > Then we need no API extension - PCA will just be an algorithm , building > a model with specific parameters and dataset, and storing the results in > a dataset and related features. The prediction model itself will use > the result dataset. > We might need kind of super-model, encapsulating PCA + prediction model > in this case. > > In fact, I was advocating some months earlier than we need to > distinguish between descriptor calculation algorithms and descriptor > calculation models - there are several examples of descriptor > calculation algorithms, which use set of parameters and even datasets as > parameters. If we switch to have all features generated by models, not > algorithms, than the API becames lot more consistent. > >> In the use case descibed above, the feature-recalculation-service would >> return a list of feature-selection-model-uris, which are stored in the >> model, and can be used for recalculating the features for the test compound. >> >> > It seems to me the real case might be even more complex than having a > "descriptor_recalculation" service. The more generic solution will be > to have kind of complex service (workflow?), where one could specify > series of steps and dependencies, necessary to build a model/get > predictions. For example, data cleaning and sampling are also > algorithms, that could be used in the model building process, and in > using models for prediction later. > >> Alternatively, we could, for the beginning, not support feature >> selection/preprocessing algorithms like PCA. >> >> >> > This might be a reasonable option for the demo in Berlin. > > Best regards, > Nina > >> What do you think? >> >> >> Best regards, >> Martin >> >> >> >> >> >> >> >> >>> Tobias Girschick wrote: >>> >>> >>>> Hi Nina, >>>> >>>> the green and the black lines are two possibilities to go through the >>>> workflow. In the pdf the workflow has to be read from bottom to top >>>> (more or less). Everything starts with some prediction application (e.g. >>>> ToxPredict or a ValidationService,...) that needs descriptors to be >>>> recalculated for prediction. I added the third variant in red arrows and >>>> made 3 out of the one slide to make it easier readable. >>>> >>>> In version 1 (black) no descriptor recalculation service is needed and >>>> every model service has to delegate the descriptor recalculation to all >>>> descriptor calculation services. >>>> In version 2 (green) the descriptor recalculation service is called by >>>> the model service. The recalc service delegates the necessary descriptor >>>> calculations. In both cases the model service gets a dataset that has >>>> not all the descriptors needed to use the model for predicting the >>>> dataset. >>>> In version 3 (red) the descriptor recalculation service is called >>>> directly by the application, delegates the descriptor calculations at >>>> updates the dataset. This updated dataset is the submitted by the >>>> application itself to the model service. >>>> >>>> I hope this clarifies my rough sketch from last week. >>>> >>>> regards, >>>> Tobias >>>> >>>> On Tue, 2010-04-20 at 14:42 +0300, Nina Jeliazkova wrote: >>>> >>>> >>>> >>>>> Hi Tobias, >>>>> >>>>> Could you tell what's the difference between black and green lines in >>>>> your schema? >>>>> >>>>> I would suggest starting a new wiki page under API to discuss descriptor >>>>> calculator and its API. >>>>> >>>>> Best regards, >>>>> Nina >>>>> >>>>> Tobias Girschick wrote: >>>>> >>>>> >>>>> >>>>>> Hi All, >>>>>> >>>>>> I attached one slide which illustrates the problem from my point of >>>>>> view. The green and the black lines are the two possibilities. Note >>>>>> >>>>>> >>> that >>> >>> >>>>>> the "descriptor recalculator" has to be implemented only once (if it is >>>>>> generic). Otherwise, every new algorithm that learns models has to >>>>>> provide the whole functionality of calling all the different descriptor >>>>>> calculation services. >>>>>> >>>>>> I think that wrapping the distribution to the different descriptor >>>>>> calculation services makes things a lot easier. >>>>>> >>>>>> Just to again kick-off the discussion. >>>>>> regards, >>>>>> Tobias >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> ------------------------------------------------------------------------ >>>> >>>> This body part will be downloaded on demand. >>>> >>>> >>> _______________________________________________ >>> Development mailing list >>> Development at opentox.org >>> http://www.opentox.org/mailman/listinfo/development >>> >>> >>> >> >> >> > > _______________________________________________ > Development mailing list > Development at opentox.org > http://www.opentox.org/mailman/listinfo/development >
- Previous message: [OTDev] descriptor recalculation
- Next message: [OTDev] descriptor recalculation
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list