[OTDev] Clustering and Scaling Algorithms

Fri Sep 3 19:22:24 CEST 2010

On Fri, Sep 3, 2010 at 8:14 PM, chung <chvng at mail.ntua.gr> wrote:

> Dear All,
>  We had a discussion here about new kind of algorithms like clustering
> and scaling ones and I think we need to clarify some details to proceed
> with the implementation of such functionalities. First of all, some
> algorithms in order to produce reliable results (e.g. SVM) have to be
> fed with scaled data (whose values vary between -1 and 1 or 0 and 1 in
> some cases) as training sets. This requires not just a "scaling service"
> that accepts a dataset as input and creates a new dataset with the
> scaled data but the minimum and maximum values per feature of the
> dataset have to be stored also somewhere. These values could be either
> saved in the model, stored in some new kind of resource (e.g.
> under /scaling_parameters/123) or be retrieved (dynamically) from an
> existing dataset (e.g. from /dataset/{id}/minmax or something
> equivalent). So this is something to be discussed. Note that the SVM
> training algorithm produces high quality results but only if it uses
> scaled data as input and note also that a test dataset applied to a
> model for prediction need to be scaled with respect to the min and max
> values of the training dataset. For synchronization and data consistency
> reasons I would suggest that getting min/max from /dataset/{id}/minmax
> is the best way to go.
>

I would prefer minmax to be per feature, not per dataset , and uris like
/feature/{id}/minmax to return the min/max values (or other relevant
statistics. There might be "statistics" algorithm that could return min,
max, average, standard deviation, etc.  Of course if it could run close to
the dataset service for performance reasons.

For scaling one could have "scaling " algorithm that produce "scaled"
dataset with the current API

e.g.
curl -X POST -d "dataset_uri=" /algorithm/scaling

returns dataset uri of the scaled dataset.

>   Second, clustering algorithms are of high importance in predictive
> toxicology but it is unclear how can a cluster be represented in
> OpenTox. We plan to implement a new training algorithm whose vital
> component is a clustering routine and we are wondering how could this be
> materialized as an OpenTox web service. It needs to be mentioned that a
> cluster is not just a dataset and a client should be able to tell (using
> some web service) whether a compound belongs to a given cluster (which
> is something different compared to its belonging to a dataset). There
> are lots of algorithms that could be introduced in OT and these could be
> also part of our discussion in Rhodes about new services.
>
>
We a running clustering algorithm since January, cluster is just a feature,
linked to the algorithm.

http://apps.ideaconsult.net:8080/ambit2/algorithm/SimpleKMeans

Nina

> Best Regards,
> NTUA development team :-)
>
>
>
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>
>

-- 

Dr. Nina Jeliazkova
Technical Manager
4 A.Kanchev str.
IdeaConsult Ltd.
1000 Sofia, Bulgaria
Phone: +359 886 802011