Monte Carlo repetitions with customized partitions

k = 5; %number of partitions
c = cvpartition(Labels{2},"KFold", k ,"Stratify",true);
test_idx = test(c,"all");
for ii = 1:5
%%%% Divide into train and test set via logical indexing (5columns = 5
%%%% partitions. Label 1 & 3 are always used for testing
testIndices(:,ii) = logical([ones(numel(Labels{1}),1); test_idx(:,ii); ones(numel(Labels{3}),1)]);
end
c = cvpartition("CustomPartition",testIndices);
I want to customize partitions for cross-validation, but with some of the samples to be tested for in each partition. Is there a way to do it?
I tried using cvpartition, but I can either customize the partitions and get the Error: "Each observation must be present in one test set."
Or I use monte carlo repetitions which allows for samples to be used more than once as testing set, but then I cant customize the sets anymore.
I'm thankful for any hint.

9 comentarios

Harald
Harald el 1 de Abr. de 2024
Hi,
I am not sure if I understand the question correctly: do you want to use some samples always for testing, but never for training? In that case, I'd use cvpartition on all samples except those and would manually add the samples back in that are always supposed to be used for testing. The setdiff function may be helpful for the "all except" part.
If you get error messages, please always also include the specific code that generated the error message.
Best wishes,
Harald
correct. I want to use some samples always for testing & another SET of samples which is randomly partitioned into training & testing samples. I think that I am already doing what you suggested (see code). Bu when I want to add everything back into a second cvpartition object it gives the error message. I ultimately wanna use the cvparition object for the "Sequential Deature Selection" function. Here I need to add the cvpartition object as described
I am assuming that by the "Sequential Feature Selection" function, you mean sequentialfs.
Not sure if that will work and behave as you intend, but consider creating the cvpartition object only from Labels{2} and then customizing it the way you want and adding the additional test samples in when defining the function for criterion selection. Assuming your current function is called myFun, you could then use an anonymous function handle to include the samples that need to always be included:
fun = @(XTrain,yTrain,XTest,yTest) myFun(XTrain,yTrain, [XTest; XAlwaysIn], [yTest; yAlwaysIn])
Best wishes,
Harald
Tobias Rieker
Tobias Rieker el 3 de Abr. de 2024
Editada: Tobias Rieker el 3 de Abr. de 2024
hm. It makes sense. But I dont know where to add the anonymous function handle. This is my function
[toKeep, ranking] = sequentialfs(@errorFun,XTrain,yTrain,"cv",c,"nfeatures",nfeatures,"options",opts);
function error = errorFun(xtrain,ytrain,xtest,ytest)
% Create the model with the learning method of choice
classifier = fitcdiscr(xtrain,ytrain);
% Calculate the number of test observations misclassified
ypred = predict(classifier,xtest);
error = nnz(ypred ~= ytest);
end
Suggestion based on this:
fun = @(XTrain,yTrain,XTest,yTest) errorFun(XTrain,yTrain, [XTest; XAlwaysIn], [yTest; yAlwaysIn]);
[toKeep, ranking] = sequentialfs(fun,X,y,"cv",c,"nfeatures",nfeatures,"options",opts);
Basically, the anonymous function handle serves to modify the inputs to your function.
In the call to sequentialfs, I believe you need to pass all X and y values, not just the training data. Otherwise, sequentialfs will not have access to the test data. It might be that you are already doing this and that I was just misled by the choice of variable names XTrain and yTrain in the call to sequentialfs.
Best wishes,
Harald
Tobias Rieker
Tobias Rieker el 3 de Abr. de 2024
Editada: Tobias Rieker el 3 de Abr. de 2024
I tried it out, but still doesnt work:
SFS_xtrain is the data to be partioned (KFold = 5) Then every fold the data to be only tested is concatenated (=SFS_xAlwaysIn...)
I have no idea why the arrays are not consistent.
k = 5;
c = cvpartition(SFS_ytrain,"KFold", k ,"Stratify",true);
opts = statset("UseParallel",true);
fun = @(XTrain,yTrain,XTest,yTest) errorFun(XTrain,yTrain, [XTest; SFS_xAlwaysIn], [yTest; SFS_yAlwaysIn]); %%%% FOR CV
[toKeep, ranking] = sequentialfs(fun,SFS_xtrain,SFS_ytrain,"cv",c,"nfeatures",nfeatures,"options",opts);
function error = errorFun(XTrain,yTrain,XTest,yTest)
% Create the model with the learning method of your choice
classifier = fitcdiscr(XTrain,yTrain);
% Calculate the number of test observations misclassified
ypred = predict(classifier,XTest);
error = nnz(ypred ~= yTest);
end
______
Error using crossval>evalFun
The function '@(XTrain,yTrain,XTest,yTest)errorFun(XTrain,yTrain,[XTest;SFS_xAlwaysIn],[yTest;SFS_yAlwaysIn])' generated
the following error:
Dimensions of arrays being concatenated are not consistent.
Error in crossval>getFuncVal (line 509)
funResult = evalFun(funorStr,arg(:));
Error in crossval (line 355)
funResult = getFuncVal(1, nData, cvp, data, funorStr, []);
Error in sequentialfs>callfun (line 500)
funResult = crossval(fun,x,other_data{:},...
Error in sequentialfs (line 368)
crit(k) = callfun(fun,x,other_data,cv,mcreps,ParOptions);
Harald
Harald el 3 de Abr. de 2024
Editada: Harald el 3 de Abr. de 2024
Without access to your complete code and data, it is hard for me to tell what is going on.
I would think that XTest and SFS_xAlwaysIn have different numbers of columns or that this happens for YTest and SFS_yAlwaysIn.
For easier debugging, set UseParallel to false for the moment. Then, set a breakpoint in the anonymous function to view the dimensions of the variables, see
For further assistance, please provide a fully reproducible example, including sample data.
Best wishes,
Harald
Indeed SFS_xAlwaysIn has more columns than XTest. The reason is that Sequentialfs chooses 1,2,3...n columns (features) of the data to be tested against the testing data. This choosing of one of the features doesnt happen with the concatenated testing data SFS_xAlwaysIn though as it is "externally added". Is there a way to automatically chose the same features for SFS_xAlwaysIn?
Thank you for your big help so far. It is highly appreciated
Harald
Harald el 3 de Abr. de 2024
Duh... that makes sense. I suppose it will take some fiddling to address this.
I would try this strategy:
  • To be able to determine which columns were chosen, add fake data 1:numColumns to on top of the x-values and some nonsense value that does not appear in your y-values on top of the y-values that you supply to sequentialfs.
  • Identify which of the y-values passed to the function (either yTrain or yTest) contains the nonsense value. Extract the corresponding row of x-values from xTrain or xTest. This will tell you which columns were sent into the function.
  • Extract the corresponding columns from SFS_xAlwaysIn and add it to the test data. Be sure to remove the fake data of the first step.
I expect this to be somewhat tricky and would be happy to try to help, but would really need some sample data for SFS_xtrain and SFS_ytrain to play with. Perhaps I should be able to infer this, but I am not even sure of the data type of SFS_ytrain.
Best wishes,
Harald

Iniciar sesión para comentar.

 Respuesta aceptada

Harald
Harald el 4 de Abr. de 2024
I have now tried the approach discussed in the comments with sample data based on fisheriris.mat.
%% Sample data
load fisheriris.mat
species = categorical(species);
% Shuffle data
order = randperm(length(species));
meas = meas(order,:);
species = species(order,:);
SFS_xtrain = meas(1:130,:);
SFS_ytrain = species(1:130);
SFS_xAlwaysIn = meas(131:end,:);
SFS_yAlwaysIn = species(131:end);
%% Add fake data
SFS_xtrain = [1:size(SFS_xtrain, 2); SFS_xtrain];
SFS_ytrain = ["nonsense"; SFS_ytrain];
%% Your code (for now without setting "nfeatures" and "options")
k = 5;
c = cvpartition(SFS_ytrain,"KFold", k ,"Stratify",true);
% opts = statset("UseParallel",true);
fun = @(XTrain,yTrain,XTest,yTest) callErrorFun(XTrain,yTrain, XTest, yTest, SFS_xAlwaysIn, SFS_yAlwaysIn);
[toKeep, ranking] = sequentialfs(fun,SFS_xtrain,SFS_ytrain,"cv",c);
%% A helper function
function err = callErrorFun(XTrain,yTrain, XTest, yTest, SFS_xAlwaysIn, SFS_yAlwaysIn)
if sum(yTrain == "nonsense") == 1
idx = yTrain == "nonsense";
columns = XTrain(idx, :);
XTrain(idx,:) = [];
yTrain(idx) = [];
elseif sum(yTest == "nonsense") == 1
idx = yTest == "nonsense";
columns = XTest(idx, :);
XTest(idx,:) = [];
yTest(idx) = [];
else
error("Something unexpected happened. Revisit the approach...")
end
XTrain = [XTrain; SFS_xAlwaysIn(:, columns)];
yTrain = [yTrain; SFS_yAlwaysIn];
err = errorFun(XTrain,yTrain,XTest,yTest);
end
%% Your function
function error = errorFun(XTrain,yTrain,XTest,yTest)
% Create the model with the learning method of your choice
classifier = fitcdiscr(XTrain,yTrain);
% Calculate the number of test observations misclassified
ypred = predict(classifier,XTest);
error = nnz(ypred ~= yTest);
end
I hope you'll find this to be helpful.
Best wishes,
Harald

2 comentarios

Tobias Rieker
Tobias Rieker el 4 de Abr. de 2024
Editada: Tobias Rieker el 4 de Abr. de 2024
Thanks to your above meantioned idea I have now figured it out. Thank you!
This is my approach:
% add fake data ontop of columns
num_col = 1:numel(EMG_chanels_remaining); %EMG_chanels_remaining = number of features
SFS_xtrain = [num_col;SFS_xtrain];
SFS_ytrain = [categorical(1);SFS_ytrain];
fun = @(XTrain,yTrain,XTest,yTest) errorFun(XTrain,yTrain, XTest, SFS_xAlwaysIn, yTest,SFS_yAlwaysIn); %%%% FOR CV
% CVpartition Object
k = 10;
c = cvpartition(SFS_ytrain,"KFold", k ,"Stratify",true);
[toKeep, ranking] = sequentialfs(fun,SFS_xtrain,SFS_ytrain,"cv",c,"nfeatures",nfeatures,"options",opts);
function error = errorFun(XTrain,yTrain, XTest, SFS_xAlwaysIn, yTest,SFS_yAlwaysIn)
%Find where fake data is & extract columns of first row = the features
%included in SFS
if ismember(yTrain(1,:),categorical(1:256))
Ch_count = XTrain(1,:);
XTrain(1,:) = [];
yTrain(1,:) = [];
else
Ch_count = XTest(1,:);
XTest(1,:) = [];
yTest(1,:) = [];
end
%add data only to be tested with according columns(features)
XTrain = [XTrain;SFS_xAlwaysIn(:,Ch_count)];
yTrain = [yTrain;SFS_yAlwaysIn];
classifier = fitcdiscr(XTrain,yTrain);
% Calculate the number of test observations misclassified
ypred = predict(classifier,XTest);
error = nnz(ypred ~= yTest);
end
Harald
Harald el 4 de Abr. de 2024
Glad it's working for you! If you found the answer to be helpful, please consider "accept"-ing it.
Best wishes,
Harald

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Text Analytics Toolbox en Centro de ayuda y File Exchange.

Preguntada:

el 1 de Abr. de 2024

Comentada:

el 4 de Abr. de 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by