[OTDev] On the ARFF mime type

Thu Oct 1 22:10:43 CEST 2009

Excerpts from Nina Jeliazkova's message of Thu Oct 01 19:37:50 +0200 2009:
> >> What are the advantages of having separate service for data conversion,
> >> rather than being able to request ARFF mime type from Dataset resource
> >> (as Tobias suggested initially)?
> >> IMHO the latest sounds more RESTfull.

I would not say, that a data conversion service іs un-RESTful per se:

POST /data-conversion file=filename, Content-type:text/arff => returns a dataset_uri (converts input file to our internal dataset representation and posts it to a dataset service)

GET /data-conversion/{dataset_uri}, Accept:text/arff => returns ARFF
for dataset_uri

If the same data conversion routines are required at more than one
place, it would make sense to factor them out into a separate service,
to avoid duplication. But if we focus on exchanging dataset_uris and
requesting data in our (not yet decided) canonical data exchange format
it can as well be tied to the datasetset component.

I would still insist that these conversion features are only for the
conversation with the outside world. For internal data exchange we
should use a single common format. If a lot of developers need to
convert it into another format, say ARFF, this would be an argument for
a ѕeparate conversion service.

> What I am concerned is how these should be used by the client
> application.  Lets' look at the FastTox case.  The user specifies the
> dataset (by drawing compounds, searching, uploading SDF, etc.) .
> Then the application shows the list of models (I am intentionally
> skipping the endpoint selection step).  The user selects few models to
> be applied on his compounds and then presses "Predict" button.
> 
> This should initiate POSTs on Models resources, with dataset URI  as a
> parameter.   Now the Models need to dereference the dataset URI ,
> transform the content into their internal format, do the calculations 
> and (according to the current API ), return URI to the new calculated
> features (prediction results).  Here are the caveats:
> 
>  If Models expect format X that is not supported by the Dataset,
> everything will fail, unless
>     1)There is a logic in the Model that on failure it submits the
> dataset to a transformation service.  The Model should know where such
> transformation service exists and hope it will do the conversion.
>     2) It is not typical for such a logic to be in the Model, the other
> place (besides the dataset resource itself) is in the client. That means
> the client application should handle the case Model fails to apply a
> dataset, because it doesn't understand the format.  The Client App
> should find the transformation service for each Model (provided there
> are several for different formats) , get the results from conversion and
> submit to the Models.
> 
> I would prefer the case when Dataset supports several formats, then the
> Model can first ask for its preferred format, provided it will be more
> efficient for processing, and on fall back reply on a single common
> format. Client App is then becoming quite simple :)

If we define a model API like

POST /model/{id} dataset_id={dataset_id} => prediction_uri

and a dataset API

GET /dataset/{dataset-id} => internal dataset representation

the model should be able to work with the internal representation. How
it achieves this goal (work with the internal representation, convert it
internally to to another format, use a format-conversion service) is up
to the developers of the model webservice. Neither the client nor the
dataset service should have to know (or assume) anything about the
internals of the model webservice (even if they want to make their life
easier ;-)).

Best regards,
Christoph