Dividing grouped data into equally sized subgroups
24 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Maximilian Hauschel
el 5 de Ag. de 2021
Comentada: Maximilian Hauschel
el 7 de Ag. de 2021
Hello community,
for my current project i need to divide the values of a table for each group (parent-group) into 10 eqaully sized subgroups. The values within a parent-group are arranged hierarchically. By creating the subgroups, the hierarchical order has to remain, so that the first subgroup of a parent-group contains the lowest 10% of the values, the second subgroup of a parent group contains the next larger 10%-stack (>10% - 20%) of the values and so on. In the following, i will give an example:
Data = array2table(rand(1000,1));
Data.Properties.VariableNames{1} = 'Value';
ParentGroups = [1:1:10]';
Data.ParentGroups = repelem(ParentGroups,100);
Data = sortrows(Data,{'ParentGroups','Value'});
My first approach was to create the deciles of each group, which separate each parent-group into 10 equally sized parts. The problem here is, that the function i used simply outputs the values of the deciles, but there is no reference to their location in the original data. For creating the deciles, I used the following code:
decileFnc = @(x) prctile(x,[0:10:100]);
deciles = varfun(decileFnc,Data,'GroupingVariables','ParentGroups','InputVariables',1);
deciles.Fun_Value = string(deciles.Fun_Value);
deciles = [deciles(:,1:2),regexp(deciles.Fun_Value, '\s+', 'split')];
doubleFnc = @(x) double(x);
deciles = [deciles(:,1:2),varfun(doubleFnc,deciles,'InputVariables',3:13)];
The last block of code is used to unstack the output of varfun, i just shared it for the sake of completeness.
I'm really hoping for some help here and appreciate every answer.
2 comentarios
Peter Perkins
el 5 de Ag. de 2021
Maximillian, it's not clear what you are trying to do. Are you trying to tag each of the 1000 rows in the original data with the within-group decile that its value falls in, but preserve the original order of the data?
By the way, see the splitvars function. I thnk it replaces your last chunk of code.
Respuesta aceptada
Más respuestas (1)
Peter Perkins
el 6 de Ag. de 2021
>> x = rand(16,1);
>> g = repmat((1:2)',8,1);
>> t = table(g,x)
t =
16×2 table
g x
_ _______
1 0.81472
2 0.90579
1 0.12699
2 0.91338
1 0.63236
2 0.09754
1 0.2785
2 0.54688
1 0.95751
2 0.96489
1 0.15761
2 0.97059
1 0.95717
2 0.48538
1 0.80028
2 0.14189
Here's a function that takes a vector and returns the quartile bin in which each value falls.
function qbin = fun(x)
q = quantile(x,[0 .25 .5 .75 1]);
qbin = discretize(x,q);
end
Apply that function to each group of rows in the table.
>> tq = rowfun(@fun, t, "GroupingVariables","g")
tq =
16×3 table
g GroupCount Var3
_ __________ ____
1 8 3
1 8 1
1 8 2
1 8 2
1 8 4
1 8 1
1 8 4
1 8 3
2 8 3
2 8 3
2 8 1
2 8 2
2 8 4
2 8 4
2 8 2
2 8 1
rowfun has returned its output table in group order. To attach the quartile bin numbers to the correct rows of the original data, use the second output of sort.
>> [~,ord] = sort(t.g);
>> t.qbin(ord) = tq.Var3
t =
16×3 table
g x qbin
_ _______ ____
1 0.81472 3
2 0.90579 3
1 0.12699 1
2 0.91338 3
1 0.63236 2
2 0.09754 1
1 0.2785 2
2 0.54688 2
1 0.95751 4
2 0.96489 4
1 0.15761 1
2 0.97059 4
1 0.95717 4
2 0.48538 2
1 0.80028 3
2 0.14189 1
To make the result easier to confirm, sort by x within group; notice that the quartile bin numbers are as expected.
sortrows(t,["g" "qbin"])
ans =
16×3 table
g x qbin
_ _______ ____
1 0.12699 1
1 0.15761 1
1 0.63236 2
1 0.2785 2
1 0.81472 3
1 0.80028 3
1 0.95751 4
1 0.95717 4
2 0.09754 1
2 0.14189 1
2 0.54688 2
2 0.48538 2
2 0.90579 3
2 0.91338 3
2 0.96489 4
2 0.97059 4
Ver también
Categorías
Más información sobre Dates and Time en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!