I don't understand the behavior of discretize

7 views (last 30 days)
Vittorio Picco on 1 Sep 2022
Commented: Vittorio Picco on 1 Sep 2022
I can't understand what discretize does.
Example 1:
[a1, e1] = discretize(1:100,2);
I expect this to create 2 uniform bins, therefore the edges would be 0, 50,100. Because the default rule for filling the bins has a < instead of <= I get the first 49 points into bin 1, and the last 51 points into bin 2. That's what I see in a1 and e1 and it makes sense to me. (Although I understand the logic, uniform bins to me would mean that both bins should have 50 elements, but OK.)
Example 2:
[a2, e2] = discretize(1:101,2);
The edges returned are 0, 60, 120. The first 59 points end up in bin 1, the remaining 42 in bin 2. This makes no sense to me. The calculated edges make no sense, and the output is clearly bins of non-uniform width. The same output is returned in R2021a.
I must have some fundamental misunderstanding of what is happening.
Torsten on 1 Sep 2022
Edited: Torsten on 1 Sep 2022
Yes, if I define the edges manually it makes sense, but why does the syntax discretize(1:101,2) produce edges 0, 60 and 120 in the first place?
Why not ? You are able to control the edges - so just do it.

Walter Roberson on 1 Sep 2022
discretize() invokes matlab.internal.math.binpicker()
... which places the bins at "nice" locations, involving multiples of 10.
This is not documented.

Bruno Luong on 1 Sep 2022
Edited: Bruno Luong on 1 Sep 2022
The doc saids
"discretize divides the data into N bins of uniform width, choosing the bin edges to be "nice" numbers that overlap the range of the data."
Good luck to have an exact specification of "nice". I guess the purpose is when bining then plot with bar on the screen the bar are sync with digits and xticks of x-axis.

Steven Lord on 1 Sep 2022
The calculated edges make no sense, and the output is clearly bins of non-uniform width.
No, in that example the bins are uniformly 60 units wide. Non-uniform bins would be a case like the following:
h = histogram(1:101, [0 50 101]);
h.BinWidth
ans = 'nonuniform'
E = h.BinEdges
E = 1×3
0 50 101
theBinWidths = diff(E) % Different widths
theBinWidths = 1×2
50 51
It seems that your expectation of what is "uniform" is related to the number of points in the bin, and that is not the definition of "uniform" used by histogram, histcounts, or discretize. Their definition of "uniform" uses the distance between edges.
By your definition of "uniform" you could easily encounter a situation where it's impossible to create uniform bins. The obvious case is where the number of bins does not divide the number of points (for example dividing 101 points between 2 bins) but another simple case involves binning 4 points into 2 uniform bins.
[counts, edges] = histcounts([1 1 1 2], 2)
counts = 1×2
3 1
edges = 1×3
1.0000 1.5000 2.0000
Vittorio Picco on 1 Sep 2022
Yes, you are right, I used the word "uniform" a bit freely. It's just so unintuitive: split 101 in 2, who would pick 59 and 42? If MATLAB had picked 50 and 51 I don't think I would have ever asked the question...

Categories

Find more on Data Distribution Plots in Help Center and File Exchange

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by