[OTDev] On the ARFF mime type

Thu Oct 1 18:03:23 CEST 2009

Dear Pantelis,

What are the advantages of having separate service for data conversion,
rather than being able to request ARFF mime type from Dataset resource
(as Tobias suggested initially)?
IMHO the latest sounds more RESTfull.

chung wrote:
> Dear All,
>    People that develop data mining and machine learning algorithms are
> well aware of the ARFF files as these are widely adopted by many
> software tools including - but not limited to - weka. So we (ntua) and
> partners from tum would be happy if we where handed a ready to use arff
> file so that we could exploit the data therein to build models. The same
> holds for any other developed that uses weka or other computational
> tools that support arff files. In fact there is no other way of building
> a model than having the data in an arff file. 
>   
Just my two cents, based on experience when developing Toxmatch (
http://ecb.jrc.ec.europa.eu/qsar/qsar-tools/index.php?c=TOXMATCH ).
Toxmatch doesn't work with ARFF files (neither reads nor writes) but
uses clustering algorithms from weka. Weka instances are generated on
the fly, when reading SDF . Same could be achieved by reading XML .
>    However in opentox we have defined the notion of a dataset and
> (almost) all the information that we need can be found there [List of
> compound URIs and their corresponding features]. So one solution is that
> every service internally generates this valuable arff file :-) and that
> will help all the developers meet half way. That generates the need for
> a method that given a dataset uri, generates an arff file. Now, if we
> agree on this internal generation of arff files, then every service that
> requires an arff file as input has to build it from the dataset uri,
> which is a non trivial task.
>    So, what I propose is to build a utility algorithm
> (/algorithm/util/dataset2arff) and release this procedure as a web
> service. The service specifications can be close to:
>
> **** 1 ****
> GET /algorithm/util/dataset2arff
> Description: Get xml regresentation of the arff generation algorithm. 
> Mime types: xml
> curl command example: curl -H 'Accept:text/xml'
> http://www.myserver.com/algorithm/util/dataset2arff 
>
>
> **** 2 ****
> POST /algorithm/util/dataset2arff
> Posted Parameters: dataset_uri
> Description: Generate an arff file from a given dataset uri. Returns the
> uri for the generated arff file.
> Mime types: text/plain
> curl command example: curl -X POST -d
> "dataset_uri=http://www.myserver.com/dataset/1230"
> http://www.myserver.com/algorithm/util/dataset2arff 
>
> This service returns a uri for the arff file:
> http://www.myserver.com/dataset/1230/arff
>
>
> **** 3 ****
> Get the arff representation of the dataset:
> curl -H 'Accept:text/arff' http://www.myserver.com/dataset/1230/arff
>
> This provides the arff file to other services as well so there is no
> need that the arff will be regenerated. Also this arff file will be
> useful to data miners that want to perform some other analysis using
> weka. 
>
> Consider of the alternative scenario where there is no service for the
> generation of an arff file and a client needs to build multiple models.
> Then *every* model training service would internally generate an arff
> file which is (needless to say) inefficient! Of course we have to agree
> on the mime type for the arff representation (e.g. text/arff).
>
> I'm waiting for your opinions on that...
> Best Regards
> Pantelis
>   
There are several questions, that I think are important in this context:

    * How the correspondence between the ARFF file and initial chemical
      compounds is established and later used in e.g. reporting ? 
    * How the model results are assigned to the chemical compounds?
    * How about models that require other "precious"" file formats?
      Shall we introduce services /dataset2formatXXX ?   How RESTfull
      will be it?
    * If one set of Model resources accept only ARFF format and others
      accept only other formats, how it will be possible to generate a
      generic application client, for example like Fastox, specified as
      one of the first OpenTox use cases to be developed?

    * Use case (less important): I would like to create a Model,
      implementing the algorithm I've used for Toxcast data analysis
      http://www.epa.gov/comptox/toxcast/files/summit/15%20Jeliazkova%20ToxCast%20TDAS.pdf
      . It uses customized ARFF file + custom configuration file. Will
      the dataset2arff service be of help?

    * Advanced topic. Suppose we succeeded to introduce OpenTox
      authorization and authentication scheme, allowing us to specify
      complex user- and role based read/write access for datasets, as
      well as other objects.  Will access rules for the original dataset
      transfer to the generated ARFF file?  Will that be less complex
      than generating ARFF file on the fly?

IMHO anything that needs to be placed under /util resource should be
avoided. /util means procedure call, not a resource and should be
refactored as a resource, if we care about keeping RESTfull.  If we are
going to decide  not be constrained by REST, that's another story.

Best regards,
Nina

> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>