[OTDev] Descriptor Calculation Services
Nina Jeliazkova nina at acad.bgWed Jan 13 09:06:37 CET 2010
- Previous message: [OTDev] Descriptor Calculation Services
- Next message: [OTDev] Descriptor Calculation Services
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi All, chung wrote: > Hi all, > There has been a discussion on data preprocessing web services these > days so I'd like to summarize a few things here and then decide how > should we proceed with the implementation of such services. First of > all, there is no need for any changes in the current version of the API > to implement API1.1-compliant services. Preprocessing services are > algorithms, an RDF representation should be provided at: > > /algorithm/datacleanup > > and the method > > POST /algorithm/datacleanup > Content-type: application/rdf+xml > To have this aligned with current API and ontologies, I would suggest: 1) use *dataset_uri *as input parameter (as with other algorithms), rather than post the content itself 2)The URI follows generic algorithm naming scheme /algorithm/{id} . A data cleanup algorithm is a subclass of *http://www.opentox.org/algorithms.owl#DataCleanup* from AlgorithmTypes ontology, and RDF representation should contain proper rdf:type statement. 3)Agree on how the resulting URL is returned. There are several options (not mutually exclusive) - if no content is returned, the URL is within Location-ref HTTP header (this is mandatory for redirect responses, like returning task IDs) - if content is returned, it can be text/uri-list (obviously) or RDF representation , containing perhaps only a simple RDF node with the URL as node identificator. http://opentox.org/dev/apis/api-1.1/Algorithm entry is updated accordingly. (Please note for compatibility reasons, services should use "dataset_uri"to specify datasets and "prediction_feature" to denote target variable). > returns the URI of the generated cleaned-up dataset. This is the way > feature selection services should work too. > > First of all, I think we need a cleanup service that removes all string > features from the dataset You can do this right now with the following steps: 1) Get features for a dataset via /dataset/{id}/feature or any other means (e.g. looking through the entire dataset ) 2)Select string features (numerics are denoted as in the latest opentox ontology as ot:NumericFeature) 3) form the URL for the reduced dataset as /dataset/{id}?feature_uris[]=/mynumericfeature1&feature_uris[]=/mynumericfeature2&feature_uris[]=/mynumericfeature3&eature_uris[]=etc String feature dropping service will be just a convenient wrapper for the steps above. > and a second service that handles the missing > values of the dataset substituting them with the average of the median > of all other values for the same feature in the dataset. This functionality is indeed missing. > I have a > proposal about a third preprocessing service where the missing values > are calculated by some descriptor calculation service *if possible*. > > > > On Tue, 2010-01-12 at 19:23 +0200, Nina Jeliazkova wrote: > >> Christoph Helma wrote: >> >>> Excerpts from Tobias Girschick's message of Mon Jan 11 10:05:23 +0100 2010: >>> >>> >>>> Hi Pantelis, All, >>>> >>>> On Thu, 2010-01-07 at 18:49 +0200, chung wrote: >>>> >>>> >>>>> Hi Tobias, All, >>>>> While trying to train a model, the service is possible to "find" some >>>>> missing values for a specific feature. >>>>> >>>>> >>>> To obviate misunderstandings: You want to train a model with a data set >>>> that contains missing values for a specific feature and the service >>>> detects the missing features before training, right? >>>> >>>> >>>> >>>>> Is there a way to use your >>>>> services to obtain the missing value? >>>>> >>>>> >>>> If the feature with the missing values was produced from our descriptor >>>> calculation service, yes. But you would have to build a dataset with all >>>> the compounds where the value is missing and submit it to the descriptor >>>> calculation service. >>>> The question is, if a model training service should automatically >>>> provide the functionality of "filling up" missing values. I think this >>>> is something that should be done in the preprocessing phase - in a >>>> preprocessing/data cleaning service. >>>> >>>> >>> I would be extremely careful with the addition of missing features for >>> several reasons: >>> >>> - Sometimes there are good physical/chemical/biological/algorithmic reasons why >>> features are missing - calculating these features might give >>> you a number but it is very likely that it is meaningless. >>> >>> >> Agree. >> > > Yes, sometimes indeed. What about all other times. It might be an interesting topic to think how do we distinguish the two cases :) > For instance how > useful is a dataset which contains a set of compounds and values for one > and only feature (the target) without a service that calculates the > values for the other features? Information about description calculation services, used to generate existing values should be available for each Feature via ot:hasSource property. It is then straightforward to use the URL of the service to launch remote or local calculation. > I believe that there are lots of reasons > to have a service which searches for missing values in the dataset and > tries to calculate them; after all that service will not be bundled with > the model training and its use would be optional. > > Why not just use descriptor calculation services , as they currently exist? It is implementation detail if the service will prefer to calculate existing values once again or only perform calculations where these are not available (I would actually prefer the later as default implementation, purely for performance reasons). >>> - A sameAs relationship does not guarantee, that (calculated and >>> measured) feature values are comparable (very frequently they are >>> not). >>> >>> >> Right, this is the reason of having ot:hasSource for features , allowing >> to identify exactly the descriptor calculation service used. >> Replying to myself, see above for ot:hasSource usage >>> - Even if you find a measured value for the same feature, there is a >>> good chance, that it has been obtained by a different protocol and >>> that it is not comparable with the other feature values. >>> >>> >> Agree. >> >>> I would suggest to add features only >>> >>> - if you have a clear understanding, why a feature is missing >>> > > Why should you understand this? Because lack of understanding will reflect on the resulting model quality. For example it is possible to calculate logP value of 15 or even 100, but it is generally considering not meaningful for various reasons. It will be better to remove compounds with such values, when building a model. > It is missing because no service > calculated it or for any other reason It might have been a service had already attempted to calculate the values, but failed for some reason. It might be an exotic atom type , which prevent the calculations to be done properly, or lack of 3D structure, etc. We need a way to denote such cases. I would propose an extension of opentox.owl with a subclass of ot:FeatureValue (e.g. ErrorValue or something alike) to denote the case of failure/reason for failure. FeatureValue is composed of feature and value, where the feature will point to what calculation has been attempted (including the algorithm) and the value itself can be string with human readable description of the reason of failure, or URL. http://opentox.org/dev/apis/api-1.1/Feature A descriptor calculation service, requiring 3D structure will continuously fail to calculate values for structures lacking 3D coordinates. BlueObelisk descriptor ontology provides means for setting descriptor calculation requirements (2D/3D coordinates) via *http://www.blueobelisk.org/ontologies/chemoinformatics-algorithms/#requires* property . For anybody, using descriptor calculation algorithms , described in BO, please refer to the relevant algorithm, this will ensure we can use all the available information in the ontology service. For other descriptor calculation algorithms, it will be good if they can be described in terms of BO. > I can't think of right now, but a > client needs to calculate those values (if possible) and of course using > a proper descriptor calculation service to avoid calculating something > else by mistake. > > So it should be sufficient to locate the proper calculation service and run the calculations. I don't really see the reason of introducing another kind of calculation service. It is more likely we might need kind of filtering/querying data service, allowing to query a dataset for entries without values for specific features, and this should be easy to do via current algorithm API. >>> - if you can prove that the feature calculation algorithm creates values >>> that are comparable with the original measurements (or calculation >>> algorithm) >>> > > No computational tool can reproduce measured values Not a very optimistic comment for any modeling exercise /framework :) Best regards, Nina > - I'm talking about > descriptors which can be calculated given the structure of the chemical > compound. > > Best regards, > Pantelis > > >>> - if you clearly document how and why the original dataset has been >>> modified >>> >>> >> An user interface supporting the above (e.g. allowing the user to >> document why something is modified) would be relevant for both Fastox >> and Toxmodel. >> >> Best regards, >> Nina >> >>> Best regards, >>> Christoph >>> _______________________________________________ >>> Development mailing list >>> Development at opentox.org >>> http://www.opentox.org/mailman/listinfo/development >>> >>> >> _______________________________________________ >> Development mailing list >> Development at opentox.org >> http://www.opentox.org/mailman/listinfo/development >> >> > >
- Previous message: [OTDev] Descriptor Calculation Services
- Next message: [OTDev] Descriptor Calculation Services
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list