Main Content

ecdf

Empirical cumulative distribution function

Description

example

[f,x] = ecdf(y) returns the empirical cumulative distribution function f, evaluated at x, using the data in y.

example

[f,x] = ecdf(y,Name,Value) specifies additional options using one or more name-value arguments. For example, 'Function','survivor' specifies the type of function for f as a survivor function.

example

[f,x,flo,fup] = ecdf(___) also returns the lower and upper confidence bounds for the evaluated function values, using any of the input argument combinations in the previous syntaxes. This syntax is not valid for interval-censored data.

example

ecdf(___) produces a stairstep graph of the evaluated function. The function visualizes interval estimates for interval-censored data using shaded rectangles. You can specify 'Bounds','on' to include the confidence bounds in the graph for fully observed, left-censored, right-censored, and double-censored data.

ecdf(ax,___) plots on the axes specified by ax instead of the current axes (gca).

Examples

collapse all

Compute the Kaplan-Meier estimate of the empirical cumulative distribution function (cdf) for simulated survival data.

Generate survival data from a Weibull distribution with parameters 3 and 1.

rng('default')  % For reproducibility
failuretime = random('wbl',3,1,15,1);

Compute the Kaplan-Meier estimate of the empirical cdf for survival data.

[f,x] = ecdf(failuretime);
[f,x]
ans = 16×2

         0    0.0895
    0.0667    0.0895
    0.1333    0.1072
    0.2000    0.1303
    0.2667    0.1313
    0.3333    0.2718
    0.4000    0.2968
    0.4667    0.6147
    0.5333    0.6684
    0.6000    1.3749
      ⋮

Plot the estimated empirical cdf.

ecdf(failuretime)

Figure contains an axes object. The axes object contains an object of type stair.

Generate right-censored survival data and compare the empirical cumulative distribution function (cdf) with the known cdf.

Generate failure times from an exponential distribution with a mean failure time of 15.

rng('default')  % For reproducibility
y = exprnd(15,75,1);

Generate drop-out times from an exponential distribution with a mean failure time of 30.

d = exprnd(30,75,1);

Generate the observed failure times, that is, the minimum of the generated failure times and the drop-out times.

t = min(y,d);

Create a logical array containing generated failure times that are larger than the drop-out times. The data for which this condition is true is censored.

censored = (y>d);

Compute the empirical cdf and confidence bounds.

[f,x,flo,fup] = ecdf(t,'Censoring',censored);

Plot the empirical cdf and confidence bounds.

ecdf(t,'Censoring',censored,'Bounds','on')
hold on

Figure contains an axes object. The axes object contains 3 objects of type stair.

Superimpose a plot of the known population cdf.

xx = 0:.1:max(t);
yy = 1-exp(-xx/15);
plot(xx,yy,'g-','LineWidth',2)
axis([0 max(t) 0 1])
legend('Empirical cdf','Lower confidence bound', ...
    'Upper confidence bound','Known population cdf', ...
    'Location','southeast')
hold off

Figure contains an axes object. The axes object contains 4 objects of type stair, line. These objects represent Empirical cdf, Lower confidence bound, Upper confidence bound, Known population cdf.

Generate survival data and plot the empirical survivor function with 99% confidence bounds.

Generate lifetime data from a Weibull distribution with parameters 100 and 2.

rng('default')  % For reproducibility
R = wblrnd(100,2,100,1);

Plot the empirical survivor function for the data with 99% confidence bounds.

ecdf(R,'Function','survivor','Alpha',0.01,'Bounds','on')
hold on

Figure contains an axes object. The axes object contains 3 objects of type stair.

Superimpose a plot of the Weibull survivor function.

x = 1:1:250;
wblsurv = 1-cdf('weibull',x,100,2);
plot(x,wblsurv,'g-','LineWidth',2)
legend('Empirical survivor function','Lower confidence bound', ...
    'Upper confidence bound','Weibull survivor function', ...
    'Location','northeast')

Figure contains an axes object. The axes object contains 4 objects of type stair, line. These objects represent Empirical survivor function, Lower confidence bound, Upper confidence bound, Weibull survivor function.

The Weibull survivor function based on the actual distribution is within the confidence bounds.

Compute and plot the cumulative hazard function of simulated double-censored survival data.

Generate failure times from a Birnbaum-Saunders distribution.

rng('default')  % For reproducibility
failuretime = random('BirnbaumSaunders',0.3,1,[100,1]);

Assume that the study starts at time 0.1 and the ends at time 0.9. The assumption implies that failure times less than 0.1 are left censored, and failure times greater than 0.9 are right censored.

Create a vector in which each element indicates the censorship status of the corresponding observation in failuretime. Use –1, 1, and 0 to indicate left-censored, right-censored, and fully observed observations, respectively.

L = 0.1;
U = 0.9;
left_censored = (failuretime<L);
right_censored = (failuretime>U);
c = right_censored - left_censored;

Plot the empirical cumulative hazard function for the data with 95% confidence bounds.

ecdf(failuretime,'Function','cumulative hazard', ...
    'Censoring',c,'Bounds','on')

Figure contains an axes object. The axes object contains 3 objects of type stair.

Compute and plot the empirical cdf of interval-censored data.

Load the cities data set. The data includes ratings for nine different indicators of the quality of life in 329 US cities: climate, housing, health, crime, transportation, education, arts, recreation, and economics. For each indicator, a higher rating is better.

load cities

Select the first indicator (climate) as sample data.

Y = ratings(:,1);

Assume that the indicators in Y are the values rounded to the nearest integer. Then, you can treat values in Y as interval-censored observations. An observation y in Y indicates that the actual rating is between y–0.5 and y+0.5.

Create a matrix in which each row represents the interval surrounding each integer in Y.

intervalY = [Y-0.5, Y+0.5];

Compute the empirical cdf values.

[f,x] = ecdf(intervalY);

Plot the empirical cdf values.

figure
ecdf(intervalY)

Figure contains an axes object. The axes object contains 2 objects of type line, patch.

Zoom into a smaller region to see the interval estimates.

idx_roi = 21:30;
xlim([x(idx_roi(1),1) x(idx_roi(end),2)])

Figure contains an axes object. The axes object contains 2 objects of type line, patch.

Display the corresponding x and f values.

table(idx_roi',x(idx_roi,:),f(idx_roi,:), ...
    'VariableNames',{'Index','x','Empirical cdf F(x)'})
ans=10×3 table
    Index          x           Empirical cdf F(x)
    _____    ______________    __________________

     21      377.5    378.5         0.069909     
     22      382.5    383.5         0.075988     
     23      384.5    385.5         0.079027     
     24      390.5    391.5         0.082067     
     25      395.5    396.5         0.085106     
     26      397.5    398.5         0.091185     
     27      400.5    401.5         0.094225     
     28      401.5    402.5         0.097264     
     29      403.5    404.5          0.10334     
     30      409.5    410.5          0.10638     

The shaded rectangles indicate the change of empirical cdf values F(x) within the corresponding intervals. For example, the second shaded rectangle from the left in the zoomed plot corresponds to the interval (382.5,383.5]. F(382.5) is 0.075988, F(383.5) is 0.079027, and the change from 0.075988 to 0.079027 occurs in the interval (382.5,383.5]. The exact timing of the change is uncertain.

You can plot the interval estimates in different ways. If you assume that the probability change occurs at the start of each interval, you can plot the F(x) values using the first column of x.

figure
stairs(x(:,1),f)
title('Probability changes at the start')
xlabel('x')
ylabel('F(x)')
xlim([x(idx_roi(1),1) x(idx_roi(end),2)])

Figure contains an axes object. The axes object with title Probability changes at the start contains an object of type stair.

Alternatively, you can plot the F(x) values using the second column of x with the assumption that the probability change occurs at the end of each interval.

figure
stairs(x(:,2),f)
title('Probability changes at the end')
xlabel('x')
ylabel('F(x)')
xlim([x(idx_roi(1),1) x(idx_roi(end),2)])

Figure contains an axes object. The axes object with title Probability changes at the end contains an object of type stair.

Combine the previous two plots to visualize the intervals.

figure
stairs(x(:,1),f)
hold on
stairs(x(:,2),f)
title('Probability changes in the interval')
xlabel('x')
ylabel('F(x)')
xlim([x(idx_roi(1),1) x(idx_roi(end),2)])
hold off

Figure contains an axes object. The axes object with title Probability changes in the interval contains 2 objects of type stair.

Compute the empirical cumulative distribution function (cdf) for data, and create a piecewise linear distribution object using an approximation to the empirical cdf.

Load the sample data. Visualize the patient weight data using a histogram.

load patients
histogram(Weight(strcmp(Gender,'Female')))
hold on
histogram(Weight(strcmp(Gender,'Male')))
legend('Female','Male')

Figure contains an axes object. The axes object contains 2 objects of type histogram. These objects represent Female, Male.

The histogram shows that the data has two modes, one for female patients and one for male patients.

Compute the empirical cdf for the data.

[f,x] = ecdf(Weight);

Construct a piecewise linear approximation to the empirical cdf by taking a value every five points.

f = f(1:5:end);
x = x(1:5:end);

Plot the empirical cdf and the approximation.

figure
ecdf(Weight)
hold on
plot(x,f,'ko-','MarkerFace','r') 
legend('Empirical cdf','Piecewise linear approximation', ...
    'Location','best')

Figure contains an axes object. The axes object contains 2 objects of type stair, line. These objects represent Empirical cdf, Piecewise linear approximation.

Create a piecewise linear probability distribution object using the piecewise approximation of the empirical cdf.

pd = makedist('PiecewiseLinear','x',x,'Fx',f)
pd = 
  PiecewiseLinearDistribution

F(111) = 0
F(118) = 0.05
F(124) = 0.13
F(130) = 0.25
F(135) = 0.37
F(142) = 0.5
F(163) = 0.55
F(171) = 0.61
F(178) = 0.7
F(183) = 0.82
F(189) = 0.94
F(202) = 1

Generate 100 random numbers from the distribution.

rng('default') % For reproducibility
rw = random(pd,[100,1]);

Plot the random numbers to visually compare their distribution to the original data.

figure
histogram(Weight)
hold on
histogram(rw)
legend('Original data','Generated data')

Figure contains an axes object. The axes object contains 2 objects of type histogram. These objects represent Original data, Generated data.

The random numbers generated from the piecewise linear distribution have the same bimodal distribution as the original data.

Input Arguments

collapse all

Sample data and censorship information, specified as a vector of sample data or a two-column matrix of sample data and censorship information.

You can specify the censorship information for the sample data by using either the y argument or the Censoring name-value argument. ecdf ignores the Censoring argument value if y is a two-column matrix.

Specify y as a vector or a two-column matrix depending on the censorship types of the observations in y.

  • Fully observed data — Specify y as a vector of sample data.

  • Data that contains fully observed, left-censored, or right-censored observations — Specify y as a vector of sample data, and specify the Censoring name-value argument as a vector that contains the censorship information for each observation. The Censoring vector can contain 0, –1, and 1, which refer to fully observed, left-censored, and right-censored observations, respectively.

  • Data that includes interval-censored observations — Specify y as a two-column matrix of sample data and censorship information. Each row of y specifies the range of possible survival or failure times for each observation, and can have one of these values.

    • [t,t] — Fully observed at t

    • [–Inf,t] — Left-censored at t

    • [t,Inf] — Right-censored at t

    • [t1,t2] — Interval-censored between [t1,t2], where t1 < t2

ecdf ignores NaN values in y. Additionally, any NaN values in the censoring vector (Censoring) or frequency vector (Frequency) cause ecdf to ignore the corresponding rows in y.

Data Types: single | double

Target axes for the figure to which ecdf plots, specified as an Axes object.

For instance, if h is a target Axes object for a figure, then ecdf can plot to that figure as shown in the following example.

Example: ecdf(h,x)

Name-Value Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'Censoring',c,'Function','cumulative hazard','Alpha',0.025,'Bounds','on' instructs ecdf to return the cumulative hazard function and the 97.5% confidence bounds, accounting for the censored data specified by vector c.

Type of function returned by ecdf, specified as one of these values.

ValueDescription
'cdf' (default)Cumulative distribution function (cdf)
'survivor'Survivor Function
'cumulative hazard'Cumulative Hazard Function

Example: 'Function','cumulative hazard'

Indicator of censored data, specified as a vector consisting of 0, –1, and 1, which indicate fully observed, left-censored, and right-censored observations, respectively. Each element of the Censoring value indicates the censorship status of the corresponding observation in y. The Censoring value must have the same size as y. The default is a vector of 0s, indicating all observations are fully observed.

You cannot specify interval-censored observations using this argument. If the sample data includes interval-censored observations, specify y using a two-column matrix. ecdf ignores the Censoring value if y is a two-column matrix.

ecdf ignores any NaN values in the censoring vector. Additionally, any NaN values in y or the frequency vector (Frequency) cause ecdf to ignore the corresponding values in the censoring vector.

Example: 'Censoring',censored, where censored is a vector that contains censorship information.

Data Types: logical | single | double

Frequency of observations, specified as a vector of nonnegative integer counts that has the same number of rows as y. The jth element of the Frequency value gives the number of times the jth row of y was observed. The default is a vector of 1s, indicating one observation per row of y.

ecdf ignores any NaN values in this frequency vector. Additionally, any NaN values in y or the censoring vector (Censoring) cause ecdf to ignore the corresponding values in the frequency vector.

Example: 'Frequency',freq, where freq is a vector that contains the observation frequencies.

Data Types: single | double

Maximum number of iterations, specified as a positive integer. This argument is valid only for double-censored data and interval-censored data.

Example: 'IterationLimit',1e5

Data Types: single | double

Termination tolerance on the function value f, specified as a positive scalar. This argument is valid only for double-censored data and interval-censored data.

Example: 'Tolerance',1e-5

Data Types: single | double

Frequency of the iterative convex minorant (ICM) step, specified as a positive integer. This argument is valid only for interval-censored data.

ecdf uses the expectation-maximization iterative convex minorant (EMICM) algorithm [5] to compute the output f for interval-censored data. The EMICM algorithm uses either the EM algorithm or the ICM algorithm at each iteration. ecdf runs the ICM step every specified number of iterations. For example, by default, ecdf iterates the EM step nine times, runs the ICM step once, and then goes back to the EM step.

Example: 'ICMFrequency',1

Data Types: single | double

Significance level for the confidence interval of the evaluated function, specified as a scalar in the range (0,1). The default is 0.05 for 95% confidence. For a given value alpha, the confidence level is 100(1 – Alpha)%.

This argument is not valid for interval-censored data.

Example: 'Alpha',0.01 specifies the confidence level as 99%.

Data Types: single | double

Indicator for including the confidence bounds in the plot, specified as one of these values.

ValueDescription
'off' (default)Omit the confidence bounds.
'on' Include the confidence bounds.

This argument is not valid for interval-censored data.

Note

This argument is valid only for plotting.

Example: 'Bounds','on'

Output Arguments

collapse all

Function values evaluated at the points or intervals in x, returned as a column vector.

  • The point estimate indicates that the function value at x(i) is f(i).

  • The interval estimate indicates that the function value changes from f(i–1) to f(i) within the interval (x(i,1),x(i,2)]. The exact timing of the change is uncertain. For an example, see Empirical cdf of Interval-Censored Data.

The function type of f can be the cdf (default), Survivor Function, or Cumulative Hazard Function, as specified by the Function name-value argument.

Evaluation points or intervals, specified as a column vector or a two-column matrix, respectively.

  • ecdf returns a column vector for fully observed, left-censored, right-censored, and double-censored data.

    • For fully observed, left-censored, and right-censored data, ecdf removes values for censored observations from y, sorts the remaining values, removes duplicate values in the sorted values, and saves the results to the output x.

    • For double-censored data, ecdf determines which values of y correspond to the event times, sorts the values, removes duplicate values in the sorted values, and saves the results to the output x.

    The output x includes the minimum value of y as its first two values. These two values are useful for plotting the outputs of ecdf using the stairs function.

  • ecdf returns a two-column matrix for interval-censored data. ecdf evaluates the function values f at intervals called Turnbull intervals. For details, see Algorithms.

Lower confidence bound for the evaluated function, returned as a column vector. ecdf computes the bound for each observation. flo is not a simultaneous bound for the curve.

This argument is not valid for interval-censored data.

Upper confidence bound for the evaluated function, returned as a column vector. ecdf computes the bound for each observation. fup is not a simultaneous bound for the curve.

This argument is not valid for interval-censored data.

More About

collapse all

Censorship Types

ecdf supports left-censored, right-censored, and interval-censored observations.

  • Left-censored observation at time t — The event occurred before time t, and the exact event time is unknown.

  • Right-censored observation at time t — The event occurred after time t, and the exact event time is unknown.

  • Interval-censored observation within the interval [t1,t2] — The event occurred after time t1 and before time t2, and the exact event time is unknown.

Double-censored data includes both left-censored and right-censored observations.

Survivor Function

The survival function is the probability of survival as a function of time. It is also called the survivor function.

The survival function gives the probability that the survival time of an individual exceeds a certain value. Because the cumulative distribution function F(t) is the probability that the survival time is less than or equal to a given point t in time, the survival function for a continuous distribution S(t) is the complement of the cumulative distribution function: S(t) = 1 – F(t).

Cumulative Hazard Function

The hazard function h(t) is the instantaneous failure rate of an individual conditioned on the fact that the individual survived until a given time. The cumulative hazard function H(t) is the cumulative hazard up to time t.

h(t)=limΔt0P(tT<t+Δt|Tt)Δt,

H(t)=0th(u)du.

The hazard function always takes a positive value. However, these values do not correspond to probabilities and can be greater than 1.

You can obtain the cumulative hazard function values from the Survivor Function S(t) using the relation S(t) = exp(–H(t)).

Algorithms

ecdf computes the function values (f) and the confidence bounds (flo and fup) using different algorithms, depending on the censorship information. The function type of f can be the cdf (default), Survivor Function, or Cumulative Hazard Function, as specified by the Function name-value argument.

Censorship TypeAlgorithm for fAlgorithm for flo and fup
Right-censored data, which contains fully observed or right-censored observations
  • Use the Kaplan-Meier estimator for the cdf and survivor function values.

    The Kaplan-Meier estimator S^(t) is given by

    S^(t)=ti<tridiri,

    where ri is the number of observations at risk at time ti, and di is the number of failures at time ti. For more details, see Kaplan-Meier Method.

  • Use the Nelson-Aalen estimator for the cumulative hazard function values.

    The Nelson-Aalen estimator is given by

    H^(t)=ti<tdiri.

Use Greenwood’s formula, which is an approximation for the variance of the Kaplan-Meier estimator.

The variance estimate is given by

V(S^(t))=S^2(t)ti<tdiri(ridi).

Left-censored data, which contains fully observed or left-censored observations

Use the Kaplan-Meier estimator.

Use Greenwood's formula.

Double-censored data, which includes both right-censored and left-censored observations

Use Turnbull's algorithm [3][4]. You can specify the maximum number of iterations (IterationLimit) and the termination tolerance on the function value (Tolerance) for the algorithm.

Use the Fisher information matrix.

Interval-censored data, which includes interval-censored observations
  • Use the expectation-maximization iterative convex minorant (EMICM) algorithm [5]. The EMICM algorithm uses either the EM algorithm or the ICM algorithm at each iteration. The ICMFrequency name-value argument determines the frequency of the ICM algorithm. ecdf runs the ICM step every specified number of iterations. By default, ecdf iterates the EM step nine times, runs the ICM step once, and then goes back to the EM step. You can specify the maximum number of iterations (IterationLimit) and the termination tolerance on the function value (Tolerance) for the algorithm.

  • ecdf constructs mutually disjoint intervals, called Turnbull intervals, from the two-column matrix data y, and returns the Turnbull intervals (x) and the estimates (f) at the intervals. The left bounds of the intervals are from the first column of y, and the right bounds of the intervals are from the second column of y. For a fully observed observation (for the row with two of the same values [t t]), the function converts [t t] to [t–eps(t) t] to create an interval with nonzero length before constructing the Turnbull intervals.

Not supported

References

[1] Cox, D. R., and D. Oakes. Analysis of Survival Data. London: Chapman & Hall, 1984.

[2] Lawless, J. F. Statistical Models and Methods for Lifetime Data. 2nd ed., Hoboken, NJ: John Wiley & Sons, Inc., 2003.

[3] Klein, John P., and Melvin L. Moeschberger. Survival Analysis: Techniques for Censored and Truncated Data. 2nd ed. Statistics for Biology and Health. New York: Springer, 2003.

[4] Turnbull, Bruce W. "Nonparametric Estimation of a Survivorship Function with Doubly Censored Data." Journal of the American Statistical Association 69, No. 345 (1974): 169–73.

[5] Anderson-Bergman, Clifford. "An Efficient Implementation of the EMICM Algorithm for the Interval Censored NPMLE." Journal of Computational and Graphical Statistics 26, no. 2 (April 3, 2017): 463–67.

[6] Ware, James H., and David L. Demets. "Reanalysis of Some Baboon Descent Data." Biometrics 32, no. 2 (June 1976): 459–63.

Extended Capabilities

Introduced before R2006a