**How can inflection points be robustly identified on a cumulative histogram (CDF) curve in MATLAB when only discrete, noisy CDF data is available?**
Mostrar comentarios más antiguos
have a cumulative distribution curve (CDF) obtained from an image histogram. I only have the CDF data (x = intensity, y = cumulative frequency). I need to determine two inflection points on this curve (one in the early “knee” region and one in the tail/asymptotic region).
What is the correct and robust MATLAB approach to detect inflection points on such a cumulative curve using numerical derivatives (e.g., first/second derivative, curvature), especially when the data is discrete and noisy?
Any example or recommended method would be appreciated.
7 comentarios
dpb
el 26 de En. de 2026
"the correct and robust MATLAB approach to detect inflection points ... when the data is discrete and noisy?"
Probably isn't one that will be uniformly able to do such, depending heavily on just how noisy the data are and how much data are available.
Attaching at least a couple of represntative sets would certainly be useful for anybody who might be interested in looking at the problem.
My first inclination would probably be to forget about the numerical derivatives and do distribution fitting; hopefully there is at least a class of distributions that the data would be from, not just any random collection of numbers.
Torsten
el 26 de En. de 2026
I'm surprised CDF data are still noisy. After all, they are the result of an integration.
dpb
el 27 de En. de 2026
How fine is the CDF resolution would play into how much underlying noise the integration smoothing can cover up...of course, as John points out, to find the tails to have the inflection points one needs more resolution, too...
I''m not at all surprised a CDF is noisy, ESPECIALLY so in the tails. It is where you have little data that the CDF will be bumpy. And in the tails, by definition, you have little data! To give an example...
%relatively little data
X = randn(100,1);
histogram(X,25,'norm','cumcount')
Honestly, I don't think you can find anything intelligent from that CDF in terms of where an inflection point might lie. And that is from a distribution as simple and pure as you can get, a STANDARD NORMAL!
% considerably more data
X = randn(1000,1);
histogram(X,100,'norm','cumcount')
Again, a purely standard normal distribution. You cannot get anything better behaved than that. But even with 10x more data, there are still considerable bumps and jiggles, flat spots, etc., in the CDF. And the tails are expectedly worse, because that is exactly where you will see rare things happening at random places. One extra point more or less in the +/- 3 or 4 sigma region, and it will strongly impact your ability to know the curve shape there.
The good news is, IF you accept that the distribution is truly normal, then you can use MLE to estimate the distribution parameters, and then use the fitted distribution to approximate everything you want. HOWEVER, what if it is only approximately normal? Out in the tails is where those deviations will show up most strongly. And that is where the OP is asking for information!
I wouldn't call the CDFs from above "noisy" because noise is usually integrated away. In this case, there are simply not sufficient data to resolve the function.
Further, a CDF is monotonous - it's hard for me to imagine noisy, but monotonous data.
But all in all, it's a matter of definition what "noisy" means.
John D'Errico
el 28 de En. de 2026
Editada: John D'Errico
el 28 de En. de 2026
As I said, the above data is about as good as it can possibly be, a purely Gaussian data set. Yes, more data would better resolve the CDF.
A CDF will ALWAYS be monotonic. However, what if you insert some outliers in the curve? So insert a bunch of points off center from the Gaussian mode? Or insert some outliers? The CDF will still be perfectly monotonic no matter what. But now you have a mixture CDF of some sorts, where you may now have multiple components in that mixture.
A problem arises because of the source of the data, that is, it sounds like @Rezwanullah has an image. There is no good reason why the pixel data from an image is Gaussian at all, and certainly it is likely there are spurious outliers, random pixel crap. It is also certain it will be composed of multiple components, this is probably why they are doing this in the first place.
dpb
el 28 de En. de 2026
While there undoubtedly is no silver bullet, it would at least be interesting to see a couple representative datasets from OP, who seems to have disappeared.
Respuestas (1)
John D'Errico
el 26 de En. de 2026
Editada: John D'Errico
el 26 de En. de 2026
1 voto
I'm sorry, but you are looking for magic that does not exist.
An inflection point is a point where the SECOND derivative changes sign. What you need to understand is numerical differentiation is a noise amplification process. But you don't want to find just the derivative, you want to know (ROBUSTLY) where the derivative of the derivative changes sign. And that is going to be terribly difficult to do in any way that is even remotely robust.
Of course, we don't see your data, so we don't know just how nasty things might get. And we don't know what you mean by robust. Just how robust? Adequate in my eyes might be complete and utter crap to you. Very often the eye can do things that an algorithm will find difficult. You can decide that a certain bump in a curve does not really exist, that it is just noise you ar wiling to ignore. A computer has a far more difficult time of doing exactly that (despite what the movies and TV shows would suggest.)
I would consider two options, both of which will have serious flaws.
- Choose some distribution that will fit your data reasonably well. Fit it to the data CDF using a standard distribution fitting tool, typically using MLE methods. Pick off the inflection points from that fitted curve. Unfortunately, if your data is at all as noisy as you say it is, this will run like a dog with two legs - that is to say, pretty darn poorly. It will very poorly estimate those inflection points on many of your curves, if they have something at all strange happening in the tails. One idea would be to use a tool that can fit a member from the Johnson or Pearson family of distribitions, as that would be far more general than forcing the curve to be perhaps strictly Gaussian.
- I might try using a least squares spline fitting tool, like the SLM toolbox I have posted on the File Exchange. Constrain the left hand end point to have a zero slope and value for the CDF on the left end, and on the right end, a value of 1, and again a zero slope. You could even force the second derivative to be zero at both ends too. Require the curve to be monotone increasing over the entire domain. Now look for points of inflection from that spline fit. The problem is, there may be multiple points of inflection it identifies. Again, noisy data will make your life pure hell.
At least the second option will be fuly non-parametric, and it might handle some degree of crapola. It will not pin you down to a specific distribution shape. How much, I can't tell, since we don't see any data! If you do post some examples of the data you have, we can at least offer some better help.
Again, there is no real best practice in this matter. There certainly is no "correct" solution, and the word robust is far too subjective to even make a stab at it.
1 comentario
John D'Errico
el 27 de En. de 2026
Editada: John D'Errico
el 28 de En. de 2026
So, I've not seen any response from @Rezwanullah to our requests for data. And without that, there is very little I can do to be truly helpful. Worse, the request is for a PAIR of inflection points, and a typical CDF will have only a single point of inflection, typically around the mode of the PDF. As such, this is difficult yet. I must presume the PDF is a multi-model one. And of course, that will make things more difficult yet.
I'll give an example of a simple multi-modal PDF, sample from it, then show how I might solve the problem. I've done the computations offline, so that I could use my SLM toolbox on it.
% X = [randn(1,300) , 4 + 2*randn(1,500)];X = sort(X);
load('bimodaldata.mat')
histogram(X,100, normalization = 'pdf')
As you can see, the data is pretty noisy looking, but it is clearly bi-modal.
histogram(X,100, normalization = 'cdf')
Next, I'll use a spline fit to approximate the CDF. I'm using my SLM toolbox. The constraints applied to the curve were:
A monotone increasing function, equal to zero at the left end, 1 at the right end. The slopes at each end were constrained to be 0. Next, solve for the points of inflection to be found. They are the points where the second derivative is zero. the tool employed was SLMSOLVE, again from my SLM toolbox.
slm = slmengine(edges,counts,knots = 12,plot = 'on', increasing = 'on', leftvalue = 0,rightvalue = 1,leftslope = 0,rightslope = 0);
Now, the use of slmsolve identifies all points where the second derivative is zero. The 2 in the call tells it to look at that second derivative. The 0 tells it to find places where that derivative is zero.
slmsolve(slm,0,2)
ans =
0.154558020961149 1.87398530626896 3.97273467783392
The middle point of inflection arises at the cross-over point between the two modes, but I would expect to find one inflection point near near 0, and the other near 4, based on the two modes I chose initially.
Is this a robust solution? I am quite certain it will fail miserably on some data, but it did work reasonably well here, on data that is actually remarkably good, since it was a simple mixture distribution of two purely Gaussian modes. As you can see, the CDF is quite smooth looking. Had the data been contaminated with noise (i.e., outliers), we could have had a problem.
Categorías
Más información sobre Exploration and Visualization en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!



