[OTDev] On the ARFF mime type

Thu Oct 1 18:53:55 CEST 2009

Dear Nina,

On Thu, 2009-10-01 at 19:03 +0300, Nina Jeliazkova wrote:
> Dear Pantelis,
> 
> What are the advantages of having separate service for data conversion,
> rather than being able to request ARFF mime type from Dataset resource
> (as Tobias suggested initially)?
> IMHO the latest sounds more RESTfull.
> 

You mean that once we ask for the text/arff representation of a dataset,
it is going to be generated and returned to us without being saved on
the server? That's ok for me! Otherwise we have to build it
internally...

> chung wrote:
> > Dear All,
> >    People that develop data mining and machine learning algorithms are
> > well aware of the ARFF files as these are widely adopted by many
> > software tools including - but not limited to - weka. So we (ntua) and
> > partners from tum would be happy if we where handed a ready to use arff
> > file so that we could exploit the data therein to build models. The same
> > holds for any other developed that uses weka or other computational
> > tools that support arff files. In fact there is no other way of building
> > a model than having the data in an arff file. 
> >   
> Just my two cents, based on experience when developing Toxmatch (
> http://ecb.jrc.ec.europa.eu/qsar/qsar-tools/index.php?c=TOXMATCH ).
> Toxmatch doesn't work with ARFF files (neither reads nor writes) but
> uses clustering algorithms from weka. Weka instances are generated on
> the fly, when reading SDF . Same could be achieved by reading XML .

That sounds interesting... I'll check out that option of no arff files
at all... So you say we can develop a method that given a dataset URI
generates an Instances object that can be used as input to the weka
algorithms? However we work with other libraries except from weka such
as libSvm which reads data only from a file that I think has to be
generated on the fly.

> >    However in opentox we have defined the notion of a dataset and
> > (almost) all the information that we need can be found there [List of
> > compound URIs and their corresponding features]. So one solution is that
> > every service internally generates this valuable arff file :-) and that
> > will help all the developers meet half way. That generates the need for
> > a method that given a dataset uri, generates an arff file. Now, if we
> > agree on this internal generation of arff files, then every service that
> > requires an arff file as input has to build it from the dataset uri,
> > which is a non trivial task.
> >    So, what I propose is to build a utility algorithm
> > (/algorithm/util/dataset2arff) and release this procedure as a web
> > service. The service specifications can be close to:
> >
> > **** 1 ****
> > GET /algorithm/util/dataset2arff
> > Description: Get xml regresentation of the arff generation algorithm. 
> > Mime types: xml
> > curl command example: curl -H 'Accept:text/xml'
> > http://www.myserver.com/algorithm/util/dataset2arff 
> >
> >
> > **** 2 ****
> > POST /algorithm/util/dataset2arff
> > Posted Parameters: dataset_uri
> > Description: Generate an arff file from a given dataset uri. Returns the
> > uri for the generated arff file.
> > Mime types: text/plain
> > curl command example: curl -X POST -d
> > "dataset_uri=http://www.myserver.com/dataset/1230"
> > http://www.myserver.com/algorithm/util/dataset2arff 
> >
> > This service returns a uri for the arff file:
> > http://www.myserver.com/dataset/1230/arff
> >
> >
> > **** 3 ****
> > Get the arff representation of the dataset:
> > curl -H 'Accept:text/arff' http://www.myserver.com/dataset/1230/arff
> >
> > This provides the arff file to other services as well so there is no
> > need that the arff will be regenerated. Also this arff file will be
> > useful to data miners that want to perform some other analysis using
> > weka. 
> >
> > Consider of the alternative scenario where there is no service for the
> > generation of an arff file and a client needs to build multiple models.
> > Then *every* model training service would internally generate an arff
> > file which is (needless to say) inefficient! Of course we have to agree
> > on the mime type for the arff representation (e.g. text/arff).
> >
> > I'm waiting for your opinions on that...
> > Best Regards
> > Pantelis
> >   
> There are several questions, that I think are important in this context:
> 
>     * How the correspondence between the ARFF file and initial chemical
>       compounds is established and later used in e.g. reporting ? 

There are many options. The compound URIs can be included as comments or
as the first string argument of the arff file, i.e.
@relation http://someserver.com/dataset/101
@attribute id string
@attribute descriptor_1 numeric
.
.
.

However I'll consider of the no-arff-files scenario as well.

>     * How the model results are assigned to the chemical compounds?
>     * How about models that require other "precious"" file formats?
>       Shall we introduce services /dataset2formatXXX ?   How RESTfull
>       will be it?

That's a good point, but I was thinking that if a developer needed a
specific file format he/she could provide a converter dataset2formatX.
Of course it demands to put some effort on that but I'm thinking of the
efficiency of the service as a whole. 

>     * If one set of Model resources accept only ARFF format and others
>       accept only other formats, how it will be possible to generate a
>       generic application client, for example like Fastox, specified as
>       one of the first OpenTox use cases to be developed?
> 
>     * Use case (less important): I would like to create a Model,
>       implementing the algorithm I've used for Toxcast data analysis
>       http://www.epa.gov/comptox/toxcast/files/summit/15%20Jeliazkova%20ToxCast%20TDAS.pdf
>       . It uses customized ARFF file + custom configuration file. Will
>       the dataset2arff service be of help?
> 
> 
>     * Advanced topic. Suppose we succeeded to introduce OpenTox
>       authorization and authentication scheme, allowing us to specify
>       complex user- and role based read/write access for datasets, as
>       well as other objects.  Will access rules for the original dataset
>       transfer to the generated ARFF file?  Will that be less complex
>       than generating ARFF file on the fly?

I don't have any advanced experience with A/A, so I can't answer that.

> 
> IMHO anything that needs to be placed under /util resource should be
> avoided. /util means procedure call, not a resource and should be
> refactored as a resource, if we care about keeping RESTfull.  If we are
> going to decide  not be constrained by REST, that's another story.

Could you give an example of a service that you think should be placed
under /util because after all I don't understand its utility.

> 
> Best regards,
> Nina

Best Regards,
Pantelis

> 
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >   
> 
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development