[OTDev] Descriptor Calculation Services
chung chvng at mail.ntua.grThu Jan 14 23:32:32 CET 2010
- Previous message: [OTDev] Descriptor Calculation Services
- Next message: [OTDev] validation webservice online
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 2010-01-13 at 10:06 +0200, Nina Jeliazkova wrote: > Hi All, > > chung wrote: > > Hi all, > > There has been a discussion on data preprocessing web services these > > days so I'd like to summarize a few things here and then decide how > > should we proceed with the implementation of such services. First of > > all, there is no need for any changes in the current version of the API > > to implement API1.1-compliant services. Preprocessing services are > > algorithms, an RDF representation should be provided at: > > > > /algorithm/datacleanup > > > > and the method > > > > POST /algorithm/datacleanup > > Content-type: application/rdf+xml > > > To have this aligned with current API and ontologies, I would suggest: > 1) use dataset_uri as input parameter (as with other algorithms), > rather than post the content itself Agreed > 2)The URI follows generic algorithm naming scheme /algorithm/{id} . A > data cleanup algorithm is a subclass of > http://www.opentox.org/algorithms.owl#DataCleanup from AlgorithmTypes > ontology, and RDF representation should contain proper rdf:type > statement. There's already an example at http://opentox.ntua.gr:3000/algorithm/cleanup > 3)Agree on how the resulting URL is returned. There are several > options (not mutually exclusive) > - if no content is returned, the URL is within Location-ref HTTP > header (this is mandatory for redirect responses, like returning task > IDs) > - if content is returned, it can be text/uri-list (obviously) or RDF > representation , containing perhaps only a simple RDF node with the > URL as node identificator. > > http://opentox.org/dev/apis/api-1.1/Algorithm entry is updated > accordingly. > > (Please note for compatibility reasons, services should use > "dataset_uri"to specify datasets and "prediction_feature" to denote > target variable). > > > returns the URI of the generated cleaned-up dataset. This is the way > > feature selection services should work too. > > > > First of all, I think we need a cleanup service that removes all string > > features from the dataset > > You can do this right now with the following steps: > > 1) Get features for a dataset via /dataset/{id}/feature or any other > means (e.g. looking through the entire dataset ) > 2)Select string features (numerics are denoted as in the latest > opentox ontology as ot:NumericFeature) > 3) form the URL for the reduced dataset as > /dataset/{id}?feature_uris[]=/mynumericfeature1&feature_uris[]=/mynumericfeature2&feature_uris[]=/mynumericfeature3&eature_uris[]=etc > > String feature dropping service will be just a convenient wrapper for > the steps above. > > and a second service that handles the missing > > values of the dataset substituting them with the average of the median > > of all other values for the same feature in the dataset. > This functionality is indeed missing. In the next few days I will deploy a service which handles missing values this way at http://opentox.ntua.gr:3000/algorithm/mvh1 > > I have a > > proposal about a third preprocessing service where the missing values > > are calculated by some descriptor calculation service *if possible*. > > > > > > > > On Tue, 2010-01-12 at 19:23 +0200, Nina Jeliazkova wrote: > > > > > Christoph Helma wrote: > > > > > > > Excerpts from Tobias Girschick's message of Mon Jan 11 10:05:23 +0100 2010: > > > > > > > > > > > > > Hi Pantelis, All, > > > > > > > > > > On Thu, 2010-01-07 at 18:49 +0200, chung wrote: > > > > > > > > > > > > > > > > Hi Tobias, All, > > > > > > While trying to train a model, the service is possible to "find" some > > > > > > missing values for a specific feature. > > > > > > > > > > > > > > > > > To obviate misunderstandings: You want to train a model with a data set > > > > > that contains missing values for a specific feature and the service > > > > > detects the missing features before training, right? > > > > > > > > > > > > > > > > > > > > > Is there a way to use your > > > > > > services to obtain the missing value? > > > > > > > > > > > > > > > > > If the feature with the missing values was produced from our descriptor > > > > > calculation service, yes. But you would have to build a dataset with all > > > > > the compounds where the value is missing and submit it to the descriptor > > > > > calculation service. > > > > > The question is, if a model training service should automatically > > > > > provide the functionality of "filling up" missing values. I think this > > > > > is something that should be done in the preprocessing phase - in a > > > > > preprocessing/data cleaning service. > > > > > > > > > > > > > > I would be extremely careful with the addition of missing features for > > > > several reasons: > > > > > > > > - Sometimes there are good physical/chemical/biological/algorithmic reasons why > > > > features are missing - calculating these features might give > > > > you a number but it is very likely that it is meaningless. > > > > > > > > > > > Agree. > > > > > > > Yes, sometimes indeed. What about all other times. > It might be an interesting topic to think how do we distinguish the > two cases :) > > For instance how > > useful is a dataset which contains a set of compounds and values for one > > and only feature (the target) without a service that calculates the > > values for the other features? > > Information about description calculation services, used to generate > existing values should be available for each Feature via ot:hasSource > property. It is then straightforward to use the URL of the service to > launch remote or local calculation. > > I believe that there are lots of reasons > > to have a service which searches for missing values in the dataset and > > tries to calculate them; after all that service will not be bundled with > > the model training and its use would be optional. > > > > > Why not just use descriptor calculation services , as they currently > exist? It is implementation detail if the service will prefer to > calculate existing values once again or only perform calculations > where these are not available (I would actually prefer the later as > default implementation, purely for performance reasons). > > > > - A sameAs relationship does not guarantee, that (calculated and > > > > measured) feature values are comparable (very frequently they are > > > > not). > > > > > > > > > > > Right, this is the reason of having ot:hasSource for features , allowing > > > to identify exactly the descriptor calculation service used. > > > > Replying to myself, see above for ot:hasSource usage > > > > - Even if you find a measured value for the same feature, there is a > > > > good chance, that it has been obtained by a different protocol and > > > > that it is not comparable with the other feature values. > > > > > > > > > > > Agree. > > > > > > > I would suggest to add features only > > > > > > > > - if you have a clear understanding, why a feature is missing > > > > > > > > Why should you understand this? > Because lack of understanding will reflect on the resulting model > quality. For example it is possible to calculate logP value of 15 or > even 100, but it is generally considering not meaningful for various > reasons. It will be better to remove compounds with such values, when > building a model. > I don't really think one should seek for some reasoning on whether a value is acceptable or not. logP is a logarithmic quantity so indeed 100 is a "huge" but it is really hard for one to establish a borderline especially in high dimensional spaces. So that's way we need the domain of applicability: its a way to tell that a prediction of logP=100000 is not acceptable in a given context (might be in some other however). > > It is missing because no service > > calculated it or for any other reason > It might have been a service had already attempted to calculate the > values, but failed for some reason. It might be an exotic atom type , > which prevent the calculations to be done properly, or lack of 3D > structure, etc. We need a way to denote such cases. > > I would propose an extension of opentox.owl with a subclass of > ot:FeatureValue (e.g. ErrorValue or something alike) to denote the > case of failure/reason for failure. FeatureValue is composed of > feature and value, where the feature will point to what calculation > has been attempted (including the algorithm) and the value itself can > be string with human readable description of the reason of failure, or > URL. > > http://opentox.org/dev/apis/api-1.1/Feature > > A descriptor calculation service, requiring 3D structure will > continuously fail to calculate values for structures lacking 3D > coordinates. BlueObelisk descriptor ontology provides means for > setting descriptor calculation requirements (2D/3D coordinates) via > http://www.blueobelisk.org/ontologies/chemoinformatics-algorithms/#requires property . > > For anybody, using descriptor calculation algorithms , described in > BO, please refer to the relevant algorithm, this will ensure we can > use all the available information in the ontology service. > > For other descriptor calculation algorithms, it will be good if they > can be described in terms of BO. > > > I can't think of right now, but a > > client needs to calculate those values (if possible) and of course using > > a proper descriptor calculation service to avoid calculating something > > else by mistake. > > > > > So it should be sufficient to locate the proper calculation service > and run the calculations. I don't really see the reason of > introducing another kind of calculation service. It is more likely we > might need kind of filtering/querying data service, allowing to query > a dataset for entries without values for specific features, and this > should be easy to do via current algorithm API. > > > > - if you can prove that the feature calculation algorithm creates values > > > > that are comparable with the original measurements (or calculation > > > > algorithm) > > > > > > > > No computational tool can reproduce measured values > Not a very optimistic comment for any modeling exercise /framework :) Why not? I just say that a model might be able to converge to the average of the measured values in a certain way, but will never be equal, otherwise it wouldn't be a model, it would be a theory. > > Best regards, > Nina > > - I'm talking about > > descriptors which can be calculated given the structure of the chemical > > compound. > > > > Best regards, > > Pantelis > > > > > > > > - if you clearly document how and why the original dataset has been > > > > modified > > > > > > > > > > > An user interface supporting the above (e.g. allowing the user to > > > document why something is modified) would be relevant for both Fastox > > > and Toxmodel. > > > > > > Best regards, > > > Nina > > > > > > > Best regards, > > > > Christoph > > > > _______________________________________________ > > > > Development mailing list > > > > Development at opentox.org > > > > http://www.opentox.org/mailman/listinfo/development > > > > > > > > > > > _______________________________________________ > > > Development mailing list > > > Development at opentox.org > > > http://www.opentox.org/mailman/listinfo/development > > > > > > > > > > >
- Previous message: [OTDev] Descriptor Calculation Services
- Next message: [OTDev] validation webservice online
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list