[OTDev] On the ARFF mime type

Thu Oct 1 17:20:22 CEST 2009

Dear All,
   People that develop data mining and machine learning algorithms are
well aware of the ARFF files as these are widely adopted by many
software tools including - but not limited to - weka. So we (ntua) and
partners from tum would be happy if we where handed a ready to use arff
file so that we could exploit the data therein to build models. The same
holds for any other developed that uses weka or other computational
tools that support arff files. In fact there is no other way of building
a model than having the data in an arff file. 
   However in opentox we have defined the notion of a dataset and
(almost) all the information that we need can be found there [List of
compound URIs and their corresponding features]. So one solution is that
every service internally generates this valuable arff file :-) and that
will help all the developers meet half way. That generates the need for
a method that given a dataset uri, generates an arff file. Now, if we
agree on this internal generation of arff files, then every service that
requires an arff file as input has to build it from the dataset uri,
which is a non trivial task.
   So, what I propose is to build a utility algorithm
(/algorithm/util/dataset2arff) and release this procedure as a web
service. The service specifications can be close to:

**** 1 ****
GET /algorithm/util/dataset2arff
Description: Get xml regresentation of the arff generation algorithm. 
Mime types: xml
curl command example: curl -H 'Accept:text/xml'
http://www.myserver.com/algorithm/util/dataset2arff 

**** 2 ****
POST /algorithm/util/dataset2arff
Posted Parameters: dataset_uri
Description: Generate an arff file from a given dataset uri. Returns the
uri for the generated arff file.
Mime types: text/plain
curl command example: curl -X POST -d
"dataset_uri=http://www.myserver.com/dataset/1230"
http://www.myserver.com/algorithm/util/dataset2arff 

This service returns a uri for the arff file:
http://www.myserver.com/dataset/1230/arff

**** 3 ****
Get the arff representation of the dataset:
curl -H 'Accept:text/arff' http://www.myserver.com/dataset/1230/arff

This provides the arff file to other services as well so there is no
need that the arff will be regenerated. Also this arff file will be
useful to data miners that want to perform some other analysis using
weka. 

Consider of the alternative scenario where there is no service for the
generation of an arff file and a client needs to build multiple models.
Then *every* model training service would internally generate an arff
file which is (needless to say) inefficient! Of course we have to agree
on the mime type for the arff representation (e.g. text/arff).

I'm waiting for your opinions on that...
Best Regards
Pantelis