[OTDev] API 1.1. extensions
Nina Jeliazkova nina at acad.bgTue Jan 5 10:57:07 CET 2010
- Previous message: [OTDev] Usecase development roadmaps
- Next message: [OTDev] API 1.1. extensions - Numeric and Nominal data type implemented
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello All, and Happy New Year 2010! Please find below proposals and discussion points for API 1.1 extension: > > ToDo: > Nina (and other contributors) - Any required changes to API as soon as > possible and by Jan 5 latest > 1) Feature data types: Proposal (based on Pantelis suggestions and Protege guide) at http://opentox.org/data/documents/development/RDF%20files/Datatypes. Updated opentox.owl at http://opentox.org/data/documents/development/RDF%20files/OpenToxOntology/view 2) Ontology service. Proposal at http://opentox.org/dev/apis/api-1.1/Ontology%20service . Example queries for retrieving models, given an endpoint. in addition, for quick searching if an Ontology service is not available, introduce ?query=URI-of-the-owl:sameAs-entry parameter for algorithm, model and feature services, which will return all algorithms (models, features), for which owl:sameAs is given by the query. Most services already have similar functionality, but parameter value is and arbitrary string and unknown to generic clients. 3) How to identify features, generated by an algorithm and specific set of parameters According to current opentox.owl, a Feature can be assigned Algorithm, Model or Dataset as its origin (via property ot:hasSource). There is no support for Algorithm + Parameters, except if the specific case of a Model can be regarded as Algorithm + Parameter instance. One possible solution could be: - define superclass A , which is determined by Algorithm + Parameters - Make Model subclass of A - define domain of ot:hasSource as classes A and Dataset - Find a nice name for the superclass A :) This will be searchable via ontology service. 4) Compatibility between algorithms, datasets and models. There are different kind of requirements of algorithms and models. Input: - All Algorithms accept a Dataset on input (might be an empty one). - All Models accept a Dataset as input. Output: - Two kinds of Algorithms (at least) , according to the output generated: a Dataset for data processing algorithms (e.g. descriptor calculation), or a Model for model building algorithms. - An outcome of all Models is a Dataset Input and output requirements are not yet reflected in the AlgorithmTypes ontology. There is an attempt to introduce ot:hasInput and ot:hasOutput in opentox.owl, but needs to be refined. Independent variables - Algorithms , requiring only chemical structure on input (e.g. descriptor calculation, rule-based predictions like Toxtree) - Algorithms , requiring input variables, obtained by elsewhere (e.g. by data services or descriptor calculation) - Algorithms like MLR and SVM need a dataset containing only numeric values, declared to be numeric and with numeric entries exclusively. This now could be handled by requiring ot:NumericDataset for this kind of algorithm. - Classification algorithms require nominal target variable . Dependent variables - Not required for some (e.g. descriptor calculation, rule-based predictions like Toxtree) - Not required for unsupervised algorithms (clustering) - Classification algorithms require nominal target variable . - Algorithms like MLR and SVM require numeric target variable. Prediction results have to be stored under features, separate from dependent variables, in order not to overwrite/mix observed and predicted values. Such features might be set via parameters, or automatically created by algorithm/model services. 5) cleaning services (Handling of missing values, Removal of String Features, Consistency checking and so on). -Nominal variables. Given a dataset with string variables, there is no automatic way to recognise if these should be nominal or plain strings. One rather crude heuristic is to consider variable as nominal, if the number of distinct values in a dataset is less than a threshold. Better approach is to annotate features and link them to ontologies, where one can use enumerated classes for nominal variables. The first approach can be adopted as a quick workaround till end of February. -Consistency checking: for this purpose one need to be able to define the data format,required by an algorithm or model service and compare to a dataset. This could be done on the level of opentox.owl / algorithmtypes.owl , or on the API level. If on the API level , we might introduce GET /algorithm/compatibility?dataset_uri=... It would be better to be able to use not the entire dataset for compatibility check, but only RDF of features and some metadata. Similar approach can be adopted for model/dataset compatibility check. - Handling of missing values and other transformation services. I could think of two approaches - a) separate service, accepting raw dataset on input and generating transformed dataset,according some rules (better for flexibility) b) embedded in the model generation, and configurable via some options (better for performance, e.g. available weka filters can be used) We might borrow some ideas, or completely reuse existing ontologies for data mining . Some of the partners might be especially interested in (or already aware of) KDDOnto (http://boole.diiga.univpm.it/paper/ida09.pdf ) Data mining ontology (DMO_ http://www.e-lico.eu/?q=dmo OntoDM http://kt.ijs.si/panovp/OntoDM KDDONTO: An Ontology for Discovery and Composition of KDD Algorithms <http://boole.diiga.univpm.it/paper/sokd09.pdf> . Ontology-driven KDD Process Composition <http://boole.diiga.univpm.it/paper/ida09.pdf> Best regards, Nina
- Previous message: [OTDev] Usecase development roadmaps
- Next message: [OTDev] API 1.1. extensions - Numeric and Nominal data type implemented
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list