Using kmedoids with custom distance function with several input variables

15 visualizaciones (últimos 30 días)
I have a matrix X = [XCat XNum] where:
XCat is a matrix made of dummy variables resulting from encoding categorical variables
XNum is a matrix of continuous variables.
I want to apply a clustering algorithm, that keeps into account the categorical nature of part of the features in X. So I create a custom distance function, that uses the Hamming distance for the encoded categorical variables (dummies), and L1 (cityblock) for the continuous variable. This is the function:
function D = MixDistance(XCat,XNum)
% Mixed categorical/numerical distance
% INPUT:
% XCat = matrix nObsCat x nFeatures of categorical features
% XNum = matrix nObsNum x nFeatures of numerical features
% OUTPUT:
% D = matrix of distances (nObsCat+nObsNum) x (nObsCat+nObsNum)
% Number of categorical and numerical features
nCat = size(XCat,2);
nNum = size(XNum,2);
% Compute distances, separately
DCat = pdist2(XCat, XCat, 'hamming');
DNum = pdist2(XNum, XNum, 'cityblock');
% Compute relative weight based on the number of categorical variables
wCat = nCat/(nCat + nNum);
D = wCat*DCat + (1 - wCat)*DNum;
Now, one should be tempted to call kmedoids like this:
[IDX, C, SUMD, D, MIDX, INFO] = kmedoids(X,3,'distance', @MixDistance,'replicates',3);
but of course it doesn't work as the function MixDistance need XCat,XNum as input, not just X.
also, because of the way handles work, this doesn't work either:
[IDX, C, SUMD, D, MIDX, INFO] = kmedoids(X,3,'distance', MixDistance(XCat, XNum),'replicates',3);
Any idea?
Or alternatively, any idea on clustering when data are mixed, that is BOTH categorical AND continuous?
  2 comentarios
the cyclist
the cyclist el 5 de Feb. de 2021
Can you upload a sample of the X data in a MAT file, to make it easier for folks to investigate?
Raffaele Zenti
Raffaele Zenti el 5 de Feb. de 2021
Yes, sure - already uploaded a sample of this matrix. The first 16 columns are dummies (i.e., XCat), the others are continuous (i.e., XNum).

Iniciar sesión para comentar.

Respuesta aceptada

the cyclist
the cyclist el 6 de Feb. de 2021
I think you need to do something like this. First define your MixDistance function as
function D = MixDistance(X,Y)
% Mixed categorical/numerical distance
% INPUT:
% XCat = matrix nObsCat x nFeatures of categorical features
% XNum = matrix nObsNum x nFeatures of numerical features
% OUTPUT:
% D = matrix of distances (nObsCat+nObsNum) x (nObsCat+nObsNum)
% Number of categorical and numerical features
nCat = 16;
nNum = 12;
% Compute distances, separately
DCat = pdist2(X(:,1:nCat), Y(:,1:nCat), 'hamming');
DNum = pdist2(X(:,nCat+1:end), Y(:,nCat+1:end), 'cityblock');
% Compute relative weight based on the number of categorical variables
wCat = nCat/(nCat + nNum);
D = wCat*DCat + (1 - wCat)*DNum;
end
I did two things here. First, I changed it to accept two arguments, as a distance function needs to.
Second, I explicit define the categorical and numerical columns inside the function. If you don't know those ahead of the function call, you could write some code to figure it out, based on columns that have only the (0,1) dummy indices.
This function will work when called as
[IDX, C, SUMD, D, MIDX, INFO] = kmedoids(X,3,'Distance', @MixDistance,'Replicates',3);

Más respuestas (0)

Categorías

Más información sobre Creating and Concatenating Matrices en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by