Fit a statistical distribution to truncated data

30 visualizaciones (últimos 30 días)
Sim
Sim el 19 de Jun. de 2023
Editada: Sim el 20 de Jun. de 2023
I have a "truncated dataset" and I would need to infer the distribution that most likely fits the data. Even though I have a "truncated dataset", instead of a "full dataset", I think that the best fitting distribution would be that one that could describe the "full dataset". This best-fitting distribution would be something like what is depicted by the blue line in this plot:
Do you have any comment, suggestion, or idea on how to get that blue line ?
When I tried to reproduce - with the fitdist function - the blue line in the above-mentioned figure, i.e. the best-fitting distribution as if I had the "full dataset", I was not successful. Here below you can find a comparison between the fitdist applied to the "full dataset" and the "truncated dataset", having both the same "origin", i.e. makedist('Normal','mu',3).
% (1) from a normal probability distribution, i.e. "makedist('Normal','mu',3)",
% create:
% (i) a "full dataset" and
% (ii) a set of "truncated data"
pd = makedist('Normal','mu',3);
t = truncate(pd,3,inf);
data_full = random(pd,10000,1);
data_trunc = random(t,10000,1);
% (2) fit the normal distribution to
% (i) the "full dataset"
% (ii) the set of "truncated data"
pd_fit_full = fitdist(data_full,'normal');
pd_fit_trunc = fitdist(data_trunc,'normal');
% (3) plot
% (i.a) the "histogram of the full dataset" (from the "full dataset")
% (i.b) the density function corresponding to the distribution that fits the "full dataset"
% (ii.a) the "truncated histogram" (from the "truncated data")
% (ii.b) the density function corresponding to the distribution that fits the "truncated histogram"
xgrid = linspace(0,100,1000)';
hold on
histogram(data_full,100,'Normalization','pdf','facecolor','red')
line(xgrid,pdf(pd_fit_full,xgrid),'Linewidth',2,'color','red')
histogram(data_trunc,100,'Normalization','pdf','facecolor','blue')
line(xgrid,pdf(pd_fit_trunc,xgrid),'Linewidth',2,'color','blue')
hold off
xlim([0 10])

Respuesta aceptada

Jeff Miller
Jeff Miller el 20 de Jun. de 2023
If you would like to fit a variety of truncated distributions in addition to the normal, you might find Cupid helpful. For instance, here's an example with a 2-parameter Weibull:
pd = makedist('Weibull','a',3,'b',5);
t = truncate(pd,3,inf);
data_trunc = random(t,10000,1);
% Lower cutoff of 3 is known. Start with
% any reasonable guesses for the Weibull parameters--here, 2 & 2.
fittedDist = TruncatedXlow(Weibull2(2,2),3);
% Now estimate the Weibull parameters by maximum likelihood,
% allowing for the truncation.
fittedDist.EstML(data_trunc);
xgrid = linspace(0,100,1000)';
figure
histogram(data_trunc,100,'Normalization','pdf','facecolor','blue')
line(xgrid,fittedDist.PDF(xgrid),'Linewidth',2,'color','red')
xlim([2.5 6])
  1 comentario
Sim
Sim el 20 de Jun. de 2023
Editada: Sim el 20 de Jun. de 2023
Thanks a lot to everyone, @Jeff Miller, @the cyclist, @Torsten, for your replies and suggestions!! They are all great solutions, and I found the @Jeff Miller's one, probably, the closest one to my needs. I explain you the reason why. Even though I asked something about the fitting of a normal distribution to a normally distributed truncated dataset (that I created artificially), my real cases involve a variety of distributions (often unknown to me) that can be "detected" by the Cupid tool. Therefore, to my understanding, Cupid could be seen as an extension or generalisation of the solutions proposed by @the cyclist (i.e. Fitting a truncated normal (Gaussian) distribution) and @Torsten.
After having downloaded the Cupid's zip folder, I run the @Jeff Miller solution, which works well:
addpath('.../Cupid-master')
pd = makedist('Weibull','a',3,'b',5);
t = truncate(pd,3,inf);
data_trunc = random(t,10000,1);
% Lower cutoff of 3 is known. Start with
% any reasonable guesses for the Weibull parameters--here, 2 & 2.
fittedDist = TruncatedXlow(Weibull2(2,2),3);
% Now estimate the Weibull parameters by maximum likelihood,
% allowing for the truncation.
fittedDist.EstML(data_trunc);
xgrid = linspace(0,100,1000)';
figure
histogram(data_trunc,100,'Normalization','pdf','facecolor','blue')
line(xgrid,fittedDist.PDF(xgrid),'Linewidth',2,'color','red')
xlim([2.5 6])

Iniciar sesión para comentar.

Más respuestas (2)

Torsten
Torsten el 19 de Jun. de 2023
Editada: Torsten el 19 de Jun. de 2023
Why should it be justified to fit a dataset of a truncated normal by a normal distribution ?
pd_fit_trunc = fitdist(data_trunc,'normal');
First complete the data set "data_trunc" by reflection at x = 3 such that it becomes distributed according to a normal distribution. Then you can fit it by a normal distribution:
% (1) from a normal probability distribution, i.e. "makedist('Normal','mu',3)",
% create:
% (i) a "full dataset" and
% (ii) a set of "truncated data"
pd = makedist('Normal','mu',3);
t = truncate(pd,3,inf);
data_full = random(pd,10000,1);
data_trunc = random(t,10000,1);
data_trunc = [data_trunc;-(data_trunc-3)+3];
% (2) fit the normal distribution to
% (i) the "full dataset"
% (ii) the set of "truncated data"
pd_fit_full = fitdist(data_full,'normal');
pd_fit_trunc = fitdist(data_trunc,'normal');
% (3) plot
% (i.a) the "histogram of the full dataset" (from the "full dataset")
% (i.b) the density function corresponding to the distribution that fits the "full dataset"
% (ii.a) the "truncated histogram" (from the "truncated data")
% (ii.b) the density function corresponding to the distribution that fits the "truncated histogram"
xgrid = linspace(0,100,1000)';
hold on
histogram(data_full,100,'Normalization','pdf','facecolor','red')
line(xgrid,pdf(pd_fit_full,xgrid),'Linewidth',2,'color','red')
histogram(data_trunc,100,'Normalization','pdf','facecolor','blue')
line(xgrid,pdf(pd_fit_trunc,xgrid),'Linewidth',2,'color','blue')
hold off
xlim([0 10])
  1 comentario
Sim
Sim el 20 de Jun. de 2023
Editada: Sim el 20 de Jun. de 2023
@Torsten thanks a lot!! It is a very good answer, but then, I was thinking, how can we really now how is the distribution below the last known value, i.e. 3 ? (I got this doubt, thanks to your nice answer, thanks a lot!)
I mean, Yes, it works if you already know that the underlying distribution is a normal one.... but what if you do not have any previous knowledge?
For example, in my case, the real one, and not this "toy-one", I do not have a real knowledge, and I use the best fitting distribution tool to understand approximately how data could be distributed, at least for the known part, i.e. the over the value 3...

Iniciar sesión para comentar.


the cyclist
the cyclist el 19 de Jun. de 2023
I haven't used it much, but this File Exchange submission seems to do a pretty good job.
pd = makedist('Normal','mu',3);
t = truncate(pd,3,inf);
data_trunc = random(t,10000,1);
[norm_trunc, phat, phat_ci] = fitdist_ntrunc(data_trunc, [3, Inf]);
xgrid = linspace(0,100,1000)';
figure
histogram(data_trunc,100,'Normalization','pdf','facecolor','blue')
line(xgrid,norm_trunc(xgrid,phat(1),phat(2)),'Linewidth',2,'color','red')
xlim([0 10])
  1 comentario
Sim
Sim el 20 de Jun. de 2023
@the cyclist Thanks a lot! ......But.....does that work also for other distributions? (Yes, in my question I asked about a set of data that are normally distributed, but it was just as to make it simple...actually I might have a variety of distributions....)

Iniciar sesión para comentar.

Categorías

Más información sobre Probability Distributions en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by