Finding optimal regression tree using hyperparameter optimization
3 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Nina Buchmann
el 24 de Mayo de 2017
Comentada: Don Mathis
el 25 de Mayo de 2017
I am calculating propensity scores using fitrensemble. I am interested in finding the tree with the lowest test RMSE (as I am using the resulting model to predict outcomes in a very large second dataset). I am currently using hyperparameter optimization to find the optimal tree using the below code:
% Optimize for model
rng default
propensity_final = fitrensemble(X,Y,...
'Learner',templateTree('Surrogate','on'),...
'Weights',W,'OptimizeHyperparameters',{'Method','NumLearningCycles','MaxNumSplits','LearnRate'},...
'HyperparameterOptimizationOptions',struct('Repartition',true,...
'AcquisitionFunctionName','expected-improvement-plus'));
loss_final = kfoldLoss(crossval(propensity_final,'kfold',10));
However, I find that when not optimizing for the model, hence doing one of the below, the cross-validation error is lower.
% Bagged
propensity1_bag = fitrensemble(X,Y,...
'Method','Bag',...
'Learner',templateTree('Surrogate','on'),...
'Weights',W,'OptimizeHyperparameters',{'NumLearningCycles','MaxNumSplits'},...
'HyperparameterOptimizationOptions',struct('Repartition',true,...
'AcquisitionFunctionName','expected-improvement-plus'));
loss1_bag = kfoldLoss(crossval(propensity1_bag,'kfold',10));
% LSBoost
propensity1_boost = fitrensemble(X,Y,...
'Method','LSBoost',...
'Learner',templateTree('Surrogate','on'),...
'Weights',W,'OptimizeHyperparameters',{'NumLearningCycles','MaxNumSplits','LearnRate'},...
'HyperparameterOptimizationOptions',struct('Repartition',true,...
'AcquisitionFunctionName','expected-improvement-plus'));
loss1_boost = kfoldLoss(crossval(propensity1_bag,'kfold',10));
What is the objective (best so far and estimated) that the function tries to minimize? And why are loss1_boost and loss1_bag lower than loss_final? How do I know which model to use?
Thank you!
0 comentarios
Respuesta aceptada
Don Mathis
el 24 de Mayo de 2017
Editada: Don Mathis
el 24 de Mayo de 2017
My guess is that your first run was worse because it was not run for enough iterations. The default MaxObjectiveEvaluations is 30 iterations, but since your first optimization searches a larger space (including a categorical variable) you should probably multiply that a few times. You're also using 'Repartition'=true which calls for more iterations. Try running it for at least 100 iterations. The more the better as time permits. You can pass MaxObjectiveEvaluations inside HyperparameterOptimizationOptions.
The objective being minimized for regression is log(1 + MSE) computed on the validation set. By default that's 5-fold crossvalidation. That's mentioned near the bottom of the OptimizeHyperparameters section on this doc page: http://www.mathworks.com/help/stats/fitrensemble.html#input_argument_d0e360201 Your final calls to kfoldLoss will return MSE, which will differ from the objective function values.
In any case, you should use the model that has the lowest cross-validated MSE no matter how you found it.
2 comentarios
Don Mathis
el 25 de Mayo de 2017
That's the minimum of the Gaussian Process model of the objective function that bayesopt fits under the hood. Noise is estimated and taken into account, so the minimum of the model is usually higher than the best observed value. It's a better estimate of the true minimum than the observed minimum is.
Más respuestas (0)
Ver también
Categorías
Más información sobre Gaussian Process Regression en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!