[OTDev] On the ARFF mime type

Thu Oct 1 19:37:50 CEST 2009

Dear Pantelis,

chung wrote:
> Dear Nina,
>
> On Thu, 2009-10-01 at 19:03 +0300, Nina Jeliazkova wrote:
>   
>> Dear Pantelis,
>>
>> What are the advantages of having separate service for data conversion,
>> rather than being able to request ARFF mime type from Dataset resource
>> (as Tobias suggested initially)?
>> IMHO the latest sounds more RESTfull.
>>
>>     
>
> You mean that once we ask for the text/arff representation of a dataset,
> it is going to be generated and returned to us without being saved on
> the server? That's ok for me! Otherwise we have to build it
> internally...
>
>   
Right. I've got some code ready just before Rome meeting
http://ambit.svn.sourceforge.net/viewvc/ambit/trunk/ambit2-all/ambit2-core/src/main/java/ambit2/core/io/ArffWriter.java?view=log 
, but it's not yet deployed at the server.
>> chung wrote:
>>     
>>> Dear All,
>>>    People that develop data mining and machine learning algorithms are
>>> well aware of the ARFF files as these are widely adopted by many
>>> software tools including - but not limited to - weka. So we (ntua) and
>>> partners from tum would be happy if we where handed a ready to use arff
>>> file so that we could exploit the data therein to build models. The same
>>> holds for any other developed that uses weka or other computational
>>> tools that support arff files. In fact there is no other way of building
>>> a model than having the data in an arff file. 
>>>   
>>>       
>> Just my two cents, based on experience when developing Toxmatch (
>> http://ecb.jrc.ec.europa.eu/qsar/qsar-tools/index.php?c=TOXMATCH ).
>> Toxmatch doesn't work with ARFF files (neither reads nor writes) but
>> uses clustering algorithms from weka. Weka instances are generated on
>> the fly, when reading SDF . Same could be achieved by reading XML .
>>     
>
> That sounds interesting... I'll check out that option of no arff files
> at all... So you say we can develop a method that given a dataset URI
> generates an Instances object that can be used as input to the weka
> algorithms? However we work with other libraries except from weka such
> as libSvm which reads data only from a file that I think has to be
> generated on the fly.
>
>   
It seems there is not much choice if only a file input is allowed. 
>>>    However in opentox we have defined the notion of a dataset and
>>> (almost) all the information that we need can be found there [List of
>>> compound URIs and their corresponding features]. So one solution is that
>>> every service internally generates this valuable arff file :-) and that
>>> will help all the developers meet half way. That generates the need for
>>> a method that given a dataset uri, generates an arff file. Now, if we
>>> agree on this internal generation of arff files, then every service that
>>> requires an arff file as input has to build it from the dataset uri,
>>> which is a non trivial task.
>>>    So, what I propose is to build a utility algorithm
>>> (/algorithm/util/dataset2arff) and release this procedure as a web
>>> service. The service specifications can be close to:
>>>
>>> **** 1 ****
>>> GET /algorithm/util/dataset2arff
>>> Description: Get xml regresentation of the arff generation algorithm. 
>>> Mime types: xml
>>> curl command example: curl -H 'Accept:text/xml'
>>> http://www.myserver.com/algorithm/util/dataset2arff 
>>>
>>>
>>> **** 2 ****
>>> POST /algorithm/util/dataset2arff
>>> Posted Parameters: dataset_uri
>>> Description: Generate an arff file from a given dataset uri. Returns the
>>> uri for the generated arff file.
>>> Mime types: text/plain
>>> curl command example: curl -X POST -d
>>> "dataset_uri=http://www.myserver.com/dataset/1230"
>>> http://www.myserver.com/algorithm/util/dataset2arff 
>>>
>>> This service returns a uri for the arff file:
>>> http://www.myserver.com/dataset/1230/arff
>>>
>>>
>>> **** 3 ****
>>> Get the arff representation of the dataset:
>>> curl -H 'Accept:text/arff' http://www.myserver.com/dataset/1230/arff
>>>
>>> This provides the arff file to other services as well so there is no
>>> need that the arff will be regenerated. Also this arff file will be
>>> useful to data miners that want to perform some other analysis using
>>> weka. 
>>>
>>> Consider of the alternative scenario where there is no service for the
>>> generation of an arff file and a client needs to build multiple models.
>>> Then *every* model training service would internally generate an arff
>>> file which is (needless to say) inefficient! Of course we have to agree
>>> on the mime type for the arff representation (e.g. text/arff).
>>>
>>> I'm waiting for your opinions on that...
>>> Best Regards
>>> Pantelis
>>>   
>>>       
>> There are several questions, that I think are important in this context:
>>
>>     * How the correspondence between the ARFF file and initial chemical
>>       compounds is established and later used in e.g. reporting ? 
>>     
>
> There are many options. The compound URIs can be included as comments or
> as the first string argument of the arff file, i.e.
> @relation http://someserver.com/dataset/101
> @attribute id string
> @attribute descriptor_1 numeric
> .
> .
> .
>
> However I'll consider of the no-arff-files scenario as well.
>   
Ah, Dataset URI at the relation looks very nice.  Could we have URIs for
descriptors as well ?  And the content of id could be compound URI.
>   
>>     * How the model results are assigned to the chemical compounds?
>>     * How about models that require other "precious"" file formats?
>>       Shall we introduce services /dataset2formatXXX ?   How RESTfull
>>       will be it?
>>     
>
> That's a good point, but I was thinking that if a developer needed a
> specific file format he/she could provide a converter dataset2formatX.
> Of course it demands to put some effort on that but I'm thinking of the
> efficiency of the service as a whole. 
>
>   
What I am concerned is how these should be used by the client
application.  Lets' look at the FastTox case.  The user specifies the
dataset (by drawing compounds, searching, uploading SDF, etc.) .
Then the application shows the list of models (I am intentionally
skipping the endpoint selection step).  The user selects few models to
be applied on his compounds and then presses "Predict" button.

This should initiate POSTs on Models resources, with dataset URI  as a
parameter.   Now the Models need to dereference the dataset URI ,
transform the content into their internal format, do the calculations 
and (according to the current API ), return URI to the new calculated
features (prediction results).  Here are the caveats:

 If Models expect format X that is not supported by the Dataset,
everything will fail, unless
    1)There is a logic in the Model that on failure it submits the
dataset to a transformation service.  The Model should know where such
transformation service exists and hope it will do the conversion.
    2) It is not typical for such a logic to be in the Model, the other
place (besides the dataset resource itself) is in the client. That means
the client application should handle the case Model fails to apply a
dataset, because it doesn't understand the format.  The Client App
should find the transformation service for each Model (provided there
are several for different formats) , get the results from conversion and
submit to the Models.

I would prefer the case when Dataset supports several formats, then the
Model can first ask for its preferred format, provided it will be more
efficient for processing, and on fall back reply on a single common
format. Client App is then becoming quite simple :)

>>     * If one set of Model resources accept only ARFF format and others
>>       accept only other formats, how it will be possible to generate a
>>       generic application client, for example like Fastox, specified as
>>       one of the first OpenTox use cases to be developed?
>>
>>     * Use case (less important): I would like to create a Model,
>>       implementing the algorithm I've used for Toxcast data analysis
>>       http://www.epa.gov/comptox/toxcast/files/summit/15%20Jeliazkova%20ToxCast%20TDAS.pdf
>>       . It uses customized ARFF file + custom configuration file. Will
>>       the dataset2arff service be of help?
>>
>>
>>     * Advanced topic. Suppose we succeeded to introduce OpenTox
>>       authorization and authentication scheme, allowing us to specify
>>       complex user- and role based read/write access for datasets, as
>>       well as other objects.  Will access rules for the original dataset
>>       transfer to the generated ARFF file?  Will that be less complex
>>       than generating ARFF file on the fly?
>>     
>
> I don't have any advanced experience with A/A, so I can't answer that.
>   
OK,  AA is a discussion for another thread.
>   
>> IMHO anything that needs to be placed under /util resource should be
>> avoided. /util means procedure call, not a resource and should be
>> refactored as a resource, if we care about keeping RESTfull.  If we are
>> going to decide  not be constrained by REST, that's another story.
>>     
>
> Could you give an example of a service that you think should be placed
> under /util because after all I don't understand its utility.
>
>   
IMHO in a pure RESTfull approach nothing should be there.  Of course I
might be missing some important use case.

Best regards,
Nina
>> Best regards,
>> Nina
>>     
>
> Best Regards,
> Pantelis
>
>   
>>> _______________________________________________
>>> Development mailing list
>>> Development at opentox.org
>>> http://www.opentox.org/mailman/listinfo/development
>>>   
>>>       
>> _______________________________________________
>> Development mailing list
>> Development at opentox.org
>> http://www.opentox.org/mailman/listinfo/development
>>     
>
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>