[OTDev] ARFF mime type

Nina Jeliazkova nina at acad.bg
Fri Sep 25 17:34:22 CEST 2009


Hello Richard, Tobias, All,

richard apodaca wrote:
> Hello Tobias,
>
> Please pardon some naive comments below - I'm new to the discussion...
>   
Good to have you on the OpenTox list; it is important for us to hear
fresh views outside of the project (this is what "Open" is for ;)

OpenTox partners, please excuse me for repeating some of my thoughts I
have already shared during the Rome meeting.
> Is this the format you're interested in:
>
> http://www.cs.waikato.ac.nz/~ml/weka/arff.html
>
> What kind of support exists for it? If support is sparse, who benefits most from exposing resources in that representation?
>   

ARFF files are very popular in machine learning, mostly because Weka is
the de-facto standard open source software for machine learning.  ARFF
files would be a perfect choice, if OpenTox objective was a generic
platform for machine learning. 

However, with the aim being predictive toxicology with molecules as
objects being modeled, it is somewhat different.  ARFF files doesn't
have standard support for identifying objects, let alone complex one as
molecules. Even if we invent some convention like having molecule
identifiers of certain type in the e.g. first column, this will be
"OpenTox arff file", rather than generic "arff file".
> After a quick peek, there doesn't seem to be anything there that can't be done with good ol' XHTML and JSON. Worse, ARFF doesn't look like it supports hypertext, a cornerstone of all RESTful APIs:
>
> http://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven
>   
Exactly.  This is one of my main objections of having ARFF as standard
format in OpenTox - there is no standard way to introduce molecule
identifiers (preferably in form of URI) in ARFF, nor URI to
descriptors/endpoints being used in the model. 

We might have it as "export" format, in the same way we could have e.g.
Microsoft Excel. For defining MIME type, it would be good to synchronize
this with WEKA developers - if possible - or at least sending a message
to WEKA mailing list. 

Another point for discussion is whether a format, that is linked only to
a specific implementation (WEKA) is appropriate; for example there could
be services, providing same machine learning methods as in WEKA, but
based on R, Matlab, etc.  Currently almost all services, providing
machine learning algorithms are based on WEKA, thus the ARFF preference
- but this might change in the future.

Perhaps it will help if Tobias or other partners could explain the use
cases that would benefit ARFF as a communication format between
services, rather than only as an internal format for services that are
based on WEKA.  For example, how ARFF would fit in the (very) simplified
use case below:

1) The end user specifies molecules in e.g. SDF format. This is uploaded
and became available as a dataset URI.
2) The dataset URI is submitted to a service, calculating descriptors.
The descriptors have to be assigned to the molecules in the  dataset.
3) The dataset URI  (molecules and descriptors) is submitted to a
service, offering a predictive model.
4) The model generates prediction, which needs to be assigned to the
molecules in the dataset from 1)
5) The results (molecules and predictions) are reported in an user
friendly format.


> It's only been recently that folks have started to pay attention to hypertext when developing RESTful APIs (although browsers have worked this way from the beginning). A lot of the discussion is pretty abstract. For some examples that apply to science, see:
>
> The RESTful Chemical Tracking System Series:
> http://depth-first.com/articles/2009/08/07/the-restful-chemical-tracking-system-part-1-introduction
>
> The Chemcaster API:
> http://chemcaster.com/rest
>
> See also the Sun Cloud API:
> http://kenai.com/projects/suncloudapis/pages/Home
>
> In all three examples, you'll notice a high priority placed on crafting domain-specific media types based on standard data formats like XHTML and JSON.
>
> How does Open Tox approach this issue?
>   
Have a look at dataset API  ( http://opentox.org/dev/apis/dataset )- in
the current version a dataset XML format is essentially a set of URIs,
referring to molecules and features (descriptors, etc.).  

There was a proposal to have <link ref="URI"/>  into every XML,
representing an object in OpenTox API. This might not be the case with
the current API, which is to be updated soon, and I do agree hyper-links
are essential for a fully RESTful design.

Best regards,
Nina
> Best,
> Rich
>
> ___________________________________
>
> Richard L. Apodaca
>
> http://depth-first.com      Blog
> http://metamolecular.com    Company
>
>
> --- On Fri, 9/25/09, Tobias Girschick <tobias.girschick at in.tum.de> wrote:
>
>   
>> From: Tobias Girschick <tobias.girschick at in.tum.de>
>> Subject: [OTDev] ARFF mime type
>> To: development at opentox.org
>> Date: Friday, September 25, 2009, 3:53 AM
>> Dear all,
>>
>> in Rome we were talking very shortly about a (at the
>> moment
>> non-existing) MIME type for arff files. Would it be
>> possible to agree on
>> something like 
>> text/arff 
>> although this type does officially not exist? Is there an
>> alternative? 
>> Cases where this MIME type will be of use are, for example
>> if I want to
>> retrieve a dataset via GET from /dataset/{id} in ARFF
>> format.
>>
>> Any comments?
>>
>> Regards,
>> Tobias
>> -- 
>> Dipl.-Bioinf. Tobias Girschick
>>
>> Technische Universität München
>> Institut für Informatik
>> Lehrstuhl I12 - Bioinformatik
>> Bolzmannstr. 3
>> 85748 Garching b. München, Germany
>>
>> Room: MI 01.09.042
>> Phone: +49 (89) 289-18002
>> Email: tobias.girschick at in.tum.de
>> Web: http://wwwkramer.in.tum.de/people/girschic
>>
>> _______________________________________________
>> Development mailing list
>> Development at opentox.org
>> http://www.opentox.org/mailman/listinfo/development
>>
>>     
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>   


-- 
---------------------------------
Dr. Nina Jeliazkova
Technical Manager
IdeaConsult Ltd.
1000 Sofia, Bulgaria
Tel: +359 886 802011
ICQ: 10705013
www: http://ambit.sourceforge.net
---------------------------------                          
PGP Public Key
http://cert.acad.bg/pgp-keys/keys/nina-nikolova-0xEEABA669.asc
	8E99 8BAD D804 1A43 27B7  7F87 CF04 C7D1 EEAB A669
---------------------------------------------------------------




More information about the Development mailing list