[OTDev] Missing values [was Re: DataSet]

Wed Oct 7 10:22:58 CEST 2009

Hi,

On Tue, 2009-10-06 at 14:54 -0400, Rajarshi Guha wrote:
> On Tue, Oct 6, 2009 at 1:41 PM, chung <chvng at mail.ntua.gr> wrote:
> 
> > Dear Nina, Christoph, All,
> >
> > Datasets with missing values are valid, 

I also think that datasets that contain missing values should be
considered valid.

> however we have to bear in mind
> > some density/sparsity criteria at least for the time. Its absolutely
> > impossible to train a model (even a "bad" one), using the following
> > "diagonal" dataset:
> >
> 
> But wouldn't the model development stage involve data cleaning to remove (or
> impute) missing values? 

Something like that should definitely happen before the model is built.
IMO the question is more if we do consider this kind of data cleaning
for the first prototype (if yes, I would propose to use something simple
like removing the descriptor or inserting some default or easily
calculable value)?

> And if there isn't sufficient information content,
> why would one build a model in the first place?

I agree here. The question is, where is the border? At what percentage
of missing values do I say: No, I don't build a model here?

Regards,
Tobias

> 
-- 
Dipl.-Bioinf. Tobias Girschick

Technische Universität München
Institut für Informatik
Lehrstuhl I12 - Bioinformatik
Bolzmannstr. 3
85748 Garching b. München, Germany

Room: MI 01.09.042
Phone: +49 (89) 289-18002
Email: tobias.girschick at in.tum.de
Web: http://wwwkramer.in.tum.de/people/girschic