Lasso/Elastic Net feature selection with kFold crossvalidation

Question

Juliana Corlier el 18 de Abr. de 2018

2
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/395821-lasso-elastic-net-feature-selection-with-kfold-crossvalidation

Comentada: Tyson el 23 de Jul. de 2018

I want to understand how Lasso/Elastic Net regression selects the final features when using kFold cross-validation and using the function: [B, stats] = lasso(featData, classData, 'CV', 10) (from the Statistics & ML toolbox).

In my understanding, if the model is trained 10 times on different subsets of the total sample, this may result in different features selected/penalized in every fold. However, the cross-validated model output does not provide any insight on the variability of those features across different folds. Is the best model simply chosen among all folds and applied to the entire training set? Or are features averaged/weighted based on their stability across folds?

There was a related question previously, but nobody ever answered it:

https://www.mathworks.com/matlabcentral/answers/125357-understanding-k-fold-cross-validation

Thanks for your help!

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Tyson el 23 de Jul. de 2018

This is an important thread. We are also looking for clarification on this exact question. We do not find any info about the beta values for the k-folds in the FitInfo, only a single set of beta values for each lambda. Exactly how were these betas determined?

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Bernhard Suhm el 22 de Abr. de 2018

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/395821-lasso-elastic-net-feature-selection-with-kfold-crossvalidation#answer_316558

Crossvalidation just applies to assessing model performance. As described in doc , with kfold the average error across the k different partitions will be reported. The model is trained on the complete dataset that you provide to the training function, in this case, "lasso".

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Juliana Corlier el 23 de Abr. de 2018

Thanks for your comment. However, the linked document clearly says that the original data set is partitioned, using only a subset (-not the complete data set-) to train the model. This is repeated 10 times (in my case), so the model would always be trained on slightly different subsets and in result different selected features. I get the average error part, but in my understanding, the trained models per fold are still likely be different. Here the quote:

"[...] This is done by partitioning a dataset and using a subset to train the algorithm and the remaining data for testing. Because cross-validation does not use all of the data to build a model, it is a commonly used method to prevent overfitting during training.

Each round of cross-validation involves randomly partitioning the original dataset into a training set and a testing set. The training set is then used to train a supervised learning algorithm and the testing set is used to evaluate its performance. This process is repeated several times and the average cross-validation error is used as a performance indicator."

Please advice if I am missing something.

Bernhard Suhm el 30 de Abr. de 2018

You are right, and asked internally for additional clarification. If you use the kfold argument, you don't get a "final" model back with features weighted or averaged, but pointers to all k models, whose coefficients (or selected features) may slightly differ. If they do differ, that would be a sign those features aren't very strong, so you wouldn't want them in your final model. - You can get additional information on the various fitted models in the FitInfo field of the output object, but you have to analyze the variability across different objects yourself. - Alternatively, you can retrain the model without k-fold, which will give you the best features using the complete data set.

Juliana Corlier el 11 de Mayo de 2018

Thanks for clarifying this! This is very helpful. I have a practical follow up question:

I was looking for these pointers, but I can't seem to find them. In the FitInfo struct I only get coefficients for the 72 different Lambda values (which I also get if I don't run crossvalidation). I would have expected a multidimensional struct/object for different kFolds, but my FitInfo is a 1x1 struct. Any ideas on that? Many thanks!

Iniciar sesión para comentar.

Lasso/Elastic Net feature selection with kFold crossvalidation

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Respuestas (1)

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

Lasso/Elastic Net feature selection with kFold crossvalidation

1 comentario Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Respuestas (1)

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo