[OTDev] Clustering and Scaling Algorithms

Fri Sep 3 19:37:14 CEST 2010

On Fri, 2010-09-03 at 20:22 +0300, Nina Jeliazkova wrote:

> On Fri, Sep 3, 2010 at 8:14 PM, chung <chvng at mail.ntua.gr> wrote:
> 
> > Dear All,
> >  We had a discussion here about new kind of algorithms like clustering
> > and scaling ones and I think we need to clarify some details to proceed
> > with the implementation of such functionalities. First of all, some
> > algorithms in order to produce reliable results (e.g. SVM) have to be
> > fed with scaled data (whose values vary between -1 and 1 or 0 and 1 in
> > some cases) as training sets. This requires not just a "scaling service"
> > that accepts a dataset as input and creates a new dataset with the
> > scaled data but the minimum and maximum values per feature of the
> > dataset have to be stored also somewhere. These values could be either
> > saved in the model, stored in some new kind of resource (e.g.
> > under /scaling_parameters/123) or be retrieved (dynamically) from an
> > existing dataset (e.g. from /dataset/{id}/minmax or something
> > equivalent). So this is something to be discussed. Note that the SVM
> > training algorithm produces high quality results but only if it uses
> > scaled data as input and note also that a test dataset applied to a
> > model for prediction need to be scaled with respect to the min and max
> > values of the training dataset. For synchronization and data consistency
> > reasons I would suggest that getting min/max from /dataset/{id}/minmax
> > is the best way to go.
> >
> 
> I would prefer minmax to be per feature, not per dataset , and uris like
> /feature/{id}/minmax to return the min/max values (or other relevant
> statistics. There might be "statistics" algorithm that could return min,
> max, average, standard deviation, etc.  Of course if it could run close to
> the dataset service for performance reasons.
> 

What we need is a min-max value for the values of the feature restricted
to a specific dataset.  The range of values for the feature could be
declared in the RDF of the feature as well. We could easily build a
service for getting min and max values out of a dataset but this should
involve downloading and parsing so as you said running close to the
dataset service will be more effective.

> For scaling one could have "scaling " algorithm that produce "scaled"
> dataset with the current API
> 
> e.g.
> curl -X POST -d "dataset_uri=" /algorithm/scaling
> 
> returns dataset uri of the scaled dataset.
> 
> 
> >   Second, clustering algorithms are of high importance in predictive
> > toxicology but it is unclear how can a cluster be represented in
> > OpenTox. We plan to implement a new training algorithm whose vital
> > component is a clustering routine and we are wondering how could this be
> > materialized as an OpenTox web service. It needs to be mentioned that a
> > cluster is not just a dataset and a client should be able to tell (using
> > some web service) whether a compound belongs to a given cluster (which
> > is something different compared to its belonging to a dataset). There
> > are lots of algorithms that could be introduced in OT and these could be
> > also part of our discussion in Rhodes about new services.
> >
> >
> We a running clustering algorithm since January, cluster is just a feature,
> linked to the algorithm.
> 
> http://apps.ideaconsult.net:8080/ambit2/algorithm/SimpleKMeans
> 

Great! That will be very useful...

> Nina
> 
> 
> 
> > Best Regards,
> > NTUA development team :-)
> >
> >
> >
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >
> >
> 
>