Would kfold loss values vary if cross validation is performed after model training?
4 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Charles Bergen
el 9 de Mayo de 2025
Editada: the cyclist
el 10 de Mayo de 2025
I am concerned about the difference in cross validated (CV) predictions (kfoldpredict) in regression bagged ensembles (fitrensemble) if CV is performed after a model has been trained. If I understand this correctly, a fitrensemble model without CV will have access to all available variables in a data set. Thus generated trees will have a unique set of node split values different from node split values found in trees generated from a fitrensemble with CV on. Differences in these split values would then lead to an overall difference in possible outcomes for constructed trees in both models.
I guess this would boil down to, does the crossval and subsequent kfoldloss or kfoldpredict (really any CV predict functions) functions account for these differences when supplied a model that did not peform initial cross validation?
If there is an error in my thoughts, please let me know.
I tried to supply an example of my question below.
% No initial CV
Mdl = fitrensemble(looperValues(:,1:cherrios), allratios2,... 'Learners',t,'Weights',W1,'Method','Bag','NumLearningCycles',numblearningcyc,'Options',statset('UseParallel',true));
Mdl_CV_After_Training = crossval(MdllooperPhyschemMexB, 'KFold', 10);
Mdl_CV_After_Training_kfold_predictions = kfoldpredict(Mdl_CV_After_Training)
VS
% Yes initial CV
Mdl = fitrensemble(looperValues(:,1:cherrios), allratios2, 'Learners', t, 'Crossval', 'On','Weights',W1,'Method','Bag','NumLearningCycles',numblearningcyc,'Options',statset('UseParallel',true));
Mdl_Yes_CV_kfold_predictions = kfoldpredict(Mdl_CV_After_Training)
% Would Mdl_CV_After_Training_kfold_predictions == Mdl_Yes_CV_kfold_predictions?
0 comentarios
Respuesta aceptada
the cyclist
el 9 de Mayo de 2025
The predictions will be identical, as long as you use the same fold assignments:
% Set seed, for reproducibility
rng default
% Simulate some data
N = 100;
X = randn(N,3);
y = sum(X+0.5*randn(N,1),2);
% Define a partition (which will be used for both models)
p = cvpartition(N,'KFold',10);
% Train one model using cross-validation during training
mdl_1 = fitrensemble(X,y,'CrossVal','on','CVPartition',p);
% Train a second model without using cross-validation during training, but apply it afterward
mdl_2 = fitrensemble(X,y);
mdl2_cv = crossval(mdl_2,'CVPartition',p);
% Make the k-fold predictions
y1 = kfoldPredict(mdl_1);
y2 = kfoldPredict(mdl2_cv);
% See if they are equal -- THEY ARE!
isequal(y1,y2)
If you do not make sure the two models use exactly the same fold assignments, the predictions will not be identical, but they will be statistically equivalent.
3 comentarios
the cyclist
el 9 de Mayo de 2025
Editada: the cyclist
el 10 de Mayo de 2025
To make an analogy ...
If you used
N = 1000;
x1 = randn(N,1);
x2 = randn(N,1);
to draw two samples of (pseudo)randomly generated values from a normal distribution, you would not expect those to be identical samples unless you set the seed each time, to get the same sequence. However you would expect the two samples to have the same statistical properties (the same within sampling error). Same mean, standard deviation, etc.
Similarly, I would not expect your predictions to be identical, but for all properites to be the same to within sampling error.
Más respuestas (0)
Ver también
Categorías
Más información sobre Gaussian Process Regression en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!