[OTDev] On the ARFF mime type
chung chvng at mail.ntua.grThu Oct 1 17:20:22 CEST 2009
- Previous message: [OTDev] Algorithm
- Next message: [OTDev] On the ARFF mime type
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Dear All, People that develop data mining and machine learning algorithms are well aware of the ARFF files as these are widely adopted by many software tools including - but not limited to - weka. So we (ntua) and partners from tum would be happy if we where handed a ready to use arff file so that we could exploit the data therein to build models. The same holds for any other developed that uses weka or other computational tools that support arff files. In fact there is no other way of building a model than having the data in an arff file. However in opentox we have defined the notion of a dataset and (almost) all the information that we need can be found there [List of compound URIs and their corresponding features]. So one solution is that every service internally generates this valuable arff file :-) and that will help all the developers meet half way. That generates the need for a method that given a dataset uri, generates an arff file. Now, if we agree on this internal generation of arff files, then every service that requires an arff file as input has to build it from the dataset uri, which is a non trivial task. So, what I propose is to build a utility algorithm (/algorithm/util/dataset2arff) and release this procedure as a web service. The service specifications can be close to: **** 1 **** GET /algorithm/util/dataset2arff Description: Get xml regresentation of the arff generation algorithm. Mime types: xml curl command example: curl -H 'Accept:text/xml' http://www.myserver.com/algorithm/util/dataset2arff **** 2 **** POST /algorithm/util/dataset2arff Posted Parameters: dataset_uri Description: Generate an arff file from a given dataset uri. Returns the uri for the generated arff file. Mime types: text/plain curl command example: curl -X POST -d "dataset_uri=http://www.myserver.com/dataset/1230" http://www.myserver.com/algorithm/util/dataset2arff This service returns a uri for the arff file: http://www.myserver.com/dataset/1230/arff **** 3 **** Get the arff representation of the dataset: curl -H 'Accept:text/arff' http://www.myserver.com/dataset/1230/arff This provides the arff file to other services as well so there is no need that the arff will be regenerated. Also this arff file will be useful to data miners that want to perform some other analysis using weka. Consider of the alternative scenario where there is no service for the generation of an arff file and a client needs to build multiple models. Then *every* model training service would internally generate an arff file which is (needless to say) inefficient! Of course we have to agree on the mime type for the arff representation (e.g. text/arff). I'm waiting for your opinions on that... Best Regards Pantelis
- Previous message: [OTDev] Algorithm
- Next message: [OTDev] On the ARFF mime type
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list