How are NaN Values handled by Matlab Decision Tree / Ensemble Learner

Hey,
I could not find a satisfying answer with respect to the topic provided in the title. Does Matlab exclude entire observations when a NaN value is detected for a certain feature or does it refer to NaN values with an own category (the latter would be desirable)?
Thanks for your help

 Respuesta aceptada

Raunak Gupta
Raunak Gupta el 7 de Ag. de 2020
Hi,
I assume you are using fitctree for working on Decision Trees. From here you may see that whenever a NaN value is encountered for any feature in an observation it is discarded while fitting the decision tree. Same goes for that observation's label. As for the second option it is very hard to assume or calculate any missing NaN value because the feature may follow a general rule or may be an outlier for a particular observation, so it’s better to discard rather than assume the value.
Hope this clarifies!

3 comentarios

Dario Walter
Dario Walter el 11 de Ag. de 2020
Editada: Dario Walter el 11 de Ag. de 2020
Hey Raunak,
thanks for your answer. In my case, the NaN value for certain features follows a general rule. That is why I would like to know how Matlab handles this data in fitcensemble by default.
You say that it is better to discard rather than assume the value. However, is this was Matlab actually does with fitcensemble? When looking at the model after training, NaN observations are still included.
Thanks Raunak! :)
Hi Dario,
Thanks for correcting me. So, the fitctree discards only those observations where all the features has NaN values. If the observation has some valid values it will try to find the split around those features first.
So lets say if there are 3 features for all the observation and only feature 1 has valid values for all the observations then fitctree will try to find a split based on feature 1.
"fitctree considers NaN values in X as missing values. fitctree does not use observations with all missing values for X in the fit. fitctree uses observations with some missing values for X to find splits on variables for which these observations have valid values."
Hope this clarify! :)
Thank you Raunak!

Iniciar sesión para comentar.

Más respuestas (0)

Productos

Versión

R2020a

Preguntada:

el 31 de Jul. de 2020

Comentada:

el 12 de Ag. de 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by