Lasso/Elastic Net feature selection with kFold crossvalidation

11 visualizaciones (últimos 30 días)
Juliana Corlier
Juliana Corlier el 18 de Abr. de 2018
Comentada: Tyson el 23 de Jul. de 2018
I want to understand how Lasso/Elastic Net regression selects the final features when using kFold cross-validation and using the function: [B, stats] = lasso(featData, classData, 'CV', 10) (from the Statistics & ML toolbox).
In my understanding, if the model is trained 10 times on different subsets of the total sample, this may result in different features selected/penalized in every fold. However, the cross-validated model output does not provide any insight on the variability of those features across different folds. Is the best model simply chosen among all folds and applied to the entire training set? Or are features averaged/weighted based on their stability across folds?
There was a related question previously, but nobody ever answered it:
https://www.mathworks.com/matlabcentral/answers/125357-understanding-k-fold-cross-validation
Thanks for your help!
  1 comentario
Tyson
Tyson el 23 de Jul. de 2018
This is an important thread. We are also looking for clarification on this exact question. We do not find any info about the beta values for the k-folds in the FitInfo, only a single set of beta values for each lambda. Exactly how were these betas determined?

Iniciar sesión para comentar.

Respuestas (1)

Bernhard Suhm
Bernhard Suhm el 22 de Abr. de 2018

Crossvalidation just applies to assessing model performance. As described in doc , with kfold the average error across the k different partitions will be reported. The model is trained on the complete dataset that you provide to the training function, in this case, "lasso".

  3 comentarios
Bernhard Suhm
Bernhard Suhm el 30 de Abr. de 2018
You are right, and asked internally for additional clarification. If you use the kfold argument, you don't get a "final" model back with features weighted or averaged, but pointers to all k models, whose coefficients (or selected features) may slightly differ. If they do differ, that would be a sign those features aren't very strong, so you wouldn't want them in your final model. - You can get additional information on the various fitted models in the FitInfo field of the output object, but you have to analyze the variability across different objects yourself. - Alternatively, you can retrain the model without k-fold, which will give you the best features using the complete data set.
Juliana Corlier
Juliana Corlier el 11 de Mayo de 2018
Thanks for clarifying this! This is very helpful. I have a practical follow up question:
I was looking for these pointers, but I can't seem to find them. In the FitInfo struct I only get coefficients for the 72 different Lambda values (which I also get if I don't run crossvalidation). I would have expected a multidimensional struct/object for different kFolds, but my FitInfo is a 1x1 struct. Any ideas on that? Many thanks!

Iniciar sesión para comentar.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by