[OTDev] Descriptor Calculation Services
chung chvng at mail.ntua.grMon Jan 11 19:55:50 CET 2010
- Previous message: [OTDev] Descriptor Calculation Services
- Next message: [OTDev] Descriptor Calculation Services
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi all, There has been a discussion on data preprocessing web services these days so I'd like to summarize a few things here and then decide how should we proceed with the implementation of such services. First of all, there is no need for any changes in the current version of the API to implement API1.1-compliant services. Preprocessing services are algorithms, an RDF representation should be provided at: /algorithm/datacleanup and the method POST /algorithm/datacleanup Content-type: application/rdf+xml returns the URI of the generated cleaned-up dataset. This is the way feature selection services should work too. First of all, I think we need a cleanup service that removes all string features from the dataset and a second service that handles the missing values of the dataset substituting them with the average of the median of all other values for the same feature in the dataset. I have a proposal about a third preprocessing service where the missing values are calculated by some descriptor calculation service *if possible*. On Tue, 2010-01-12 at 19:23 +0200, Nina Jeliazkova wrote: > Christoph Helma wrote: > > Excerpts from Tobias Girschick's message of Mon Jan 11 10:05:23 +0100 2010: > > > >> Hi Pantelis, All, > >> > >> On Thu, 2010-01-07 at 18:49 +0200, chung wrote: > >> > >>> Hi Tobias, All, > >>> While trying to train a model, the service is possible to "find" some > >>> missing values for a specific feature. > >>> > >> To obviate misunderstandings: You want to train a model with a data set > >> that contains missing values for a specific feature and the service > >> detects the missing features before training, right? > >> > >> > >>> Is there a way to use your > >>> services to obtain the missing value? > >>> > >> If the feature with the missing values was produced from our descriptor > >> calculation service, yes. But you would have to build a dataset with all > >> the compounds where the value is missing and submit it to the descriptor > >> calculation service. > >> The question is, if a model training service should automatically > >> provide the functionality of "filling up" missing values. I think this > >> is something that should be done in the preprocessing phase - in a > >> preprocessing/data cleaning service. > >> > > > > I would be extremely careful with the addition of missing features for > > several reasons: > > > > - Sometimes there are good physical/chemical/biological/algorithmic reasons why > > features are missing - calculating these features might give > > you a number but it is very likely that it is meaningless. > > > Agree. Yes, sometimes indeed. What about all other times. For instance how useful is a dataset which contains a set of compounds and values for one and only feature (the target) without a service that calculates the values for the other features? I believe that there are lots of reasons to have a service which searches for missing values in the dataset and tries to calculate them; after all that service will not be bundled with the model training and its use would be optional. > > - A sameAs relationship does not guarantee, that (calculated and > > measured) feature values are comparable (very frequently they are > > not). > > > Right, this is the reason of having ot:hasSource for features , allowing > to identify exactly the descriptor calculation service used. > > - Even if you find a measured value for the same feature, there is a > > good chance, that it has been obtained by a different protocol and > > that it is not comparable with the other feature values. > > > Agree. > > I would suggest to add features only > > > > - if you have a clear understanding, why a feature is missing Why should you understand this? It is missing because no service calculated it or for any other reason I can't think of right now, but a client needs to calculate those values (if possible) and of course using a proper descriptor calculation service to avoid calculating something else by mistake. > > - if you can prove that the feature calculation algorithm creates values > > that are comparable with the original measurements (or calculation > > algorithm) No computational tool can reproduce measured values - I'm talking about descriptors which can be calculated given the structure of the chemical compound. Best regards, Pantelis > > - if you clearly document how and why the original dataset has been > > modified > > > An user interface supporting the above (e.g. allowing the user to > document why something is modified) would be relevant for both Fastox > and Toxmodel. > > Best regards, > Nina > > Best regards, > > Christoph > > _______________________________________________ > > Development mailing list > > Development at opentox.org > > http://www.opentox.org/mailman/listinfo/development > > > > _______________________________________________ > Development mailing list > Development at opentox.org > http://www.opentox.org/mailman/listinfo/development >
- Previous message: [OTDev] Descriptor Calculation Services
- Next message: [OTDev] Descriptor Calculation Services
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list