[OTDev] Descriptor Calculation Services

Tue Jan 12 18:16:52 CET 2010

Excerpts from Tobias Girschick's message of Mon Jan 11 10:05:23 +0100 2010:
> Hi Pantelis, All,
> 
> On Thu, 2010-01-07 at 18:49 +0200, chung wrote: 
> > Hi Tobias, All,
> >  While trying to train a model, the service is possible to "find" some
> > missing values for a specific feature. 
> 
> To obviate misunderstandings: You want to train a model with a data set
> that contains missing values for a specific feature and the service
> detects the missing features before training, right?
> 
> > Is there a way to use your
> > services to obtain the missing value? 
> 
> If the feature with the missing values was produced from our descriptor
> calculation service, yes. But you would have to build a dataset with all
> the compounds where the value is missing and submit it to the descriptor
> calculation service.
> The question is, if a model training service should automatically
> provide the functionality of "filling up" missing values. I think this
> is something that should be done in the preprocessing phase - in a
> preprocessing/data cleaning service.

I would be extremely careful with the addition of missing features for
several reasons:

- Sometimes there are good physical/chemical/biological/algorithmic reasons why
  features are missing - calculating these features might give
  you a number but it is very likely that it is meaningless. 
- A sameAs relationship does not guarantee, that (calculated and
  measured) feature values are comparable (very frequently they are
	not).
- Even if you find a measured value for the same feature, there is a
  good chance, that it has been obtained by a different protocol and
	that it is not comparable with the other feature values.

I would suggest to add features only

- if you have a clear understanding, why a feature is missing
- if you can prove that the feature calculation algorithm creates values
  that are comparable with the original measurements (or calculation
	algorithm)
- if you clearly document how and why the original dataset has been
  modified

Best regards,
Christoph