ppca

Probabilistic principal component analysis

Syntax

[coeff,score,pcvar]
= ppca(Y,K)

[coeff,score,pcvar]
= ppca(Y,K,Name,Value)

[coeff,score,pcvar,mu]
= ppca(___)

[coeff,score,pcvar,mu,v,S]
= ppca(___)

Description

[coeff,score,pcvar] = ppca(Y,K) returns the principal component coefficients for the n-by-p data matrix Y based on a probabilistic principal component analysis (PPCA). It also returns the principal component scores, which are the representations of Y in the principal component space, and the principal component variances, which are the eigenvalues of the covariance matrix of Y, in pcvar.

Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance. Rows of score correspond to observations, and columns correspond to components. Rows of Y correspond to observations and columns correspond to variables.

Probabilistic principal component analysis might be preferable to other algorithms that handle missing data, such as the alternating least squares algorithm when any data vector has one or more missing values. It assumes that the values are missing at random through the data set. An expectation-maximization algorithm is used for both complete and missing data.

example

[coeff,score,pcvar] = ppca(Y,K,Name,Value) returns the principal component coefficients, scores, and variances using additional options for computation and handling of special data types, specified by one or more Name,Value pair arguments.

For example, you can introduce initial values for the residual variance, v, or change the termination criteria.

example

[coeff,score,pcvar,mu] = ppca(___) also returns the estimated mean of each variable in Y. You can use any of the input arguments in the previous syntaxes.

example

[coeff,score,pcvar,mu,v,S] = ppca(___) also returns the isotropic residual variance in v and the final results at convergence in structure S.

example

Examples

collapse all

Perform Probabilistic Principal Component Analysis

Open Live Script

Load the sample data.

load fisheriris

The double matrix meas consists of four types of measurements on the flowers, which, respectively, are the length and width of sepals and petals.

Introduce missing values randomly.

y = meas;
rng('default'); % for reproducibility
ix = random('unif',0,1,size(y))<0.20;
y(ix) = NaN;

Now, approximately 20% of the data is missing, indicated by NaN.

Perform probabilistic principal component analysis and request the component coefficients and variances.

[coeff,score,pcvar,mu] = ppca(y,3);
coeff

coeff = 4×3

    0.3562    0.6709   -0.5518
   -0.0765    0.7120    0.6332
    0.8592   -0.1597    0.0596
    0.3592   -0.1318    0.5395

pcvar

pcvar = 3×1

    4.0914
    0.2125
    0.0617

Perform principal component analysis using the alternating least squares algorithm and request the component coefficients and variances.

[coeff2,score2,pcvar2,mu2] = pca(y,'algorithm','als',...
'NumComponents',3);
coeff2

coeff2 = 4×3

    0.3376    0.4952    0.7406
   -0.0731    0.8609   -0.4476
    0.8657   -0.1168   -0.1233
    0.3623   -0.0086   -0.4857

pcvar2

pcvar2 = 3×1

    4.0733
    0.2652
    0.1222

The coefficients and the variances of the first two principal components are similar.

Another way to compare the results is to find the angle between the two spaces spanned by the coefficient vectors.

subspace(coeff,coeff2)

ans = 
0.0884

The angle between the two spaces is pretty small. This indicates that these two results are close to each other.

Change the Termination Criteria for Probabilistic Principal Component Analysis

Open Live Script

Load the sample data set.

load imports-85

Data matrix X has 13 continuous variables in columns 3 to 15: wheel-base, length, width, height, curb-weight, engine-size, bore, stroke, compression-ratio, horsepower, peak-rpm, city-mpg, and highway-mpg. The variables bore and stroke are missing four values in rows 56 to 59, and the variables horsepower and peak-rpm are missing two values in rows 131 and 132.

Perform probabilistic principal component analysis and display the first three principal components.

[coeff,score,pcvar] = ppca(X(:,3:15),3);

Warning: Maximum number of iterations 1000 reached.

Change the termination tolerance for the cost function to 0.01.

opt = statset('ppca');
opt.TolFun = 0.01;

Perform probabilistic principal component analysis.

[coeff,score,pcvar] = ppca(X(:,3:15),3,'Options',opt);

Warning: Maximum number of iterations 1000 reached.

ppca now terminates before the maximum number of iterations is reached because it meets the tolerance for the cost function.

Reconstruct Observations

Open Live Script

Load the sample data.

load hald
y = ingredients;

The ingredients data has 13 observations for 4 variables.

Introduce missing values to the data.

y(2:16:end) = NaN;

Every 16th value is NaN. This corresponds to 7.69% of the data.

Find the first three principal components of data using PPCA and display the reconstructed observations.

[coeff,score,pcvar,mu,v,S] = ppca(y,3);

Warning: Maximum number of iterations 1000 reached.

S.Recon

ans = 13×4

    6.8536   25.8700    5.8389   59.8730
    1.0433   28.9710   14.9654   51.9738
   11.5770   56.5067    8.6352   20.5076
   11.0835   31.0722    8.0920   47.0748
    7.0679   52.2556    6.0748   33.0598
   11.0486   55.0430    9.0534   22.0423
    2.8493   70.8691   16.8339    5.8656
    1.0333   31.0281   19.6907   44.0306
    2.0400   54.0354   18.0440   22.0349
   20.7822   46.8091    3.7603   25.8081
    0.9539   39.9591   22.9493   31.1542
   10.8250   65.8475    8.8072   11.8445
    9.9172   67.9285    7.9088   11.9257
      ⋮

You can also reconstruct the observations using the principal components and the estimated mean.

t = score*coeff' + repmat(mu,13,1);

Results at Convergence

Open Live Script

Load the data.

load hald

Here, ingredients is a real-valued matrix of predictor variables.

Perform the probabilistic principal components analysis and display coefficients.

[coeff,score,pcvariance,mu,v,S] = ppca(ingredients,3);

Warning: Maximum number of iterations 1000 reached.

coeff

coeff = 4×3

   -0.0693   -0.6459    0.5673
   -0.6786   -0.0184   -0.5440
    0.0308    0.7552    0.4036
    0.7306   -0.1102   -0.4684

Display the algorithm results at convergence of the PPCA.

S = struct with fields:
           W: [4×3 double]
        Xexp: [13×3 double]
       Recon: [13×4 double]
           v: 0.2372
     NumIter: 1000
    RMSResid: 0.2340
      nloglk: 149.3388

Display the matrix W.

S.W

ans = 4×3

    0.5624    2.0279    5.4075
    4.8320  -10.3894    5.9202
   -3.7521   -3.0555   -4.1552
   -1.5144   11.7122   -7.2564

Orthogonalizing W recovers the coefficients.

orth(S.W)

ans = 4×3

    0.0693   -0.6459    0.5673
    0.6786   -0.0184   -0.5440
   -0.0308    0.7552    0.4036
   -0.7306   -0.1102   -0.4684

Input Arguments

collapse all

`Y` — Input data
n-by-p matrix

Input data for which to compute the principal components, specified as an n-by-p matrix. Rows of Y correspond to observations and columns correspond to variables.

Data Types: single | double

`K` — Number of principal components
positive integer value less than rank

Number of principal components to return, specified as an integer value less than the rank of data. The maximum possible rank is min(n,p), where n is the number of observations and p is the number of variables. However, if the data is correlated, the rank might be smaller than min(n,p).

ppca orders the components based on their variance.

If K is min(n,p), ppca sets K equal to min(n,p) – 1, and 'W0' is truncated to min(p,n) – 1 columns if you specify a p-by-p W0 matrix.

For example, you can request only the first three components, based on the component variance as follows.

Example: coeff = ppca(Y,3)

Data Types: single | double

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'W0',init,'Options',opt specifies that the initial values for 'W0' are in matrix init and ppca uses the options defined by opt.

`W0` — Initial value of W
matrix of random values (default) | p-by-k matrix

Initial value of W in the probabilistic principal component analysis algorithm, specified as a comma-separated pair consisting of 'W0' and a p-by-k matrix.

Data Types: single | double

`v0` — Initial value of residual variance
random number (default) | positive scalar value

Initial value of residual variance, specified as the comma-separated pair consisting of 'v0' and a positive scalar value.

Data Types: single | double

`Options` — Options for iterations
structure

Options for the iterations, specified as a comma-separated pair 'Options' and a structure created by the statset function. ppca uses the following fields in the options structure.

`'Display'`	Level of display output. Choices are `'off'`, `'final'`, and `'iter'`.
`'MaxIter'`	Maximum number of steps allowed. The default is 1000. Unlike in optimization settings, reaching the `MaxIter` value is regarded as convergence.
`'TolFun'`	Positive integer stating the termination tolerance for the cost function. The default is 1e-6.
`'TolX'`	Positive integer stating the convergence threshold for the relative change in the elements of W. The default is 1e-6.

You can change the values of these fields and specify the new structure in ppca using the 'Options' name-value pair argument.

Example: opt = statset('ppca'); opt.MaxIter = 2000; coeff = ppca(Y,3,'Options',opt);

Data Types: struct

Output Arguments

collapse all

`coeff` — Principal component coefficients
p-by-k matrix

Principal component coefficients, returned as a p-by-k matrix. Each column of coeff contains coefficients for one principal component. The columns are in the order of descending component variance, pcvar.

`score` — Principal component scores
n-by-k matrix

Principal component scores, returned as an n-by-k matrix. Rows of score correspond to observations, and columns correspond to components.

`pcvar` — Principal component variances
column vector

Principal component variances, which are the eigenvalues of the covariance matrix of Y, returned as a column vector.

`mu` — Estimated mean
row vector

Estimated mean of each variable in Y, returned as a row vector.

`v` — Isotropic residual variance
scalar value

Isotropic residual variance, returned as a scalar value.

`S` — Final results at convergence
structure

Final results at convergence, returned as a structure containing the following fields.

`W`	W at convergence.
`Xexp`	Conditional expectation of the estimated latent variable x.
`Recon`	Reconstructed observations using k principal components. This is a low dimension approximation of the input data `Y`, and is equal to `mu` + `score`*`coeff'`.
`v`	Residual variance.
`RMSResid`	Root mean square of residuals.
`NumIter`	Number of iteration counts.
`nloglk`	Negative loglikelihood function value.

More About

collapse all

Probabilistic Principal Component Analysis

Probabilistic principal component analysis (PPCA) is a method to estimate the principal axes when any data vector has one or more missing values.

PPCA is based on an isotropic error model. It seeks to relate a p-dimensional observation vector y to a corresponding k-dimensional vector of latent (or unobserved) variable x, which is normal with mean zero and covariance I(k). The relationship is

$y^{T} = W * x^{T} + μ + ε,$

where y is the row vector of observed variable, x is the row vector of latent variables, and ε is the isotropic error term. ε is Gaussian with mean zero and covariance of v*I(k), where v is the residual variance. Here, k needs to be smaller than the rank for the residual variance to be greater than 0 (v>0). Standard principal component analysis, where the residual variance is zero, is the limiting case of PPCA. The observed variables, y, are conditionally independent given the values of the latent variables, x. So, the latent variables explain the correlations between the observation variables and the error explains the variability unique to a particular y_i. The p-by-k matrix W relates the latent and observation variables, and the vector μ permits the model to have a nonzero mean. PPCA assumes that the values are missing at random through the data set. This means that whether a data value is missing or not does not depend on the latent variable given the observed data values.

Under this model,

$y ~ N (μ, W * W^{T} + v * I (k)) .$

There is no closed-form analytical solution for W and v, so their estimates are determined by iterative maximization of the corresponding loglikelihood using an expectation-maximization (EM) algorithm. This EM algorithm handles missing values by treating them as additional latent variables. At convergence, the columns of W spans the subspace, but they are not orthonormal. ppca obtains the orthonormal coefficients, coeff, for the components by orthogonalization of W.

References

[1] Tipping, M. E., and C. M. Bishop. Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 61, No.3, 1999, pp. 611–622.

[2] Roweis, S. “EM Algorithms for PCA and SPCA.” In Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems. Vol.10 (NIPS 1997), Cambridge, MA, USA: MIT Press, 1998, pp. 626–632.

[3] Ilin, A., and T. Raiko. “Practical Approaches to Principal Component Analysis in the Presence of Missing Values.” J. Mach. Learn. Res.. Vol. 11, August, 2010, pp. 1957–2000.

Version History

Introduced in R2013a

ppca

Syntax

Description

Examples

Perform Probabilistic Principal Component Analysis

Change the Termination Criteria for Probabilistic Principal Component Analysis

Reconstruct Observations

Results at Convergence

Input Arguments

Y — Input data n-by-p matrix

K — Number of principal components positive integer value less than rank

Name-Value Arguments

W0 — Initial value of W matrix of random values (default) | p-by-k matrix

v0 — Initial value of residual variance random number (default) | positive scalar value

Options — Options for iterations structure

Output Arguments

coeff — Principal component coefficients p-by-k matrix

score — Principal component scores n-by-k matrix

pcvar — Principal component variances column vector

mu — Estimated mean row vector

v — Isotropic residual variance scalar value

S — Final results at convergence structure

More About

Probabilistic Principal Component Analysis

References

Version History

See Also

`Y` — Input data
n-by-p matrix

`K` — Number of principal components
positive integer value less than rank

`W0` — Initial value of W
matrix of random values (default) | p-by-k matrix

`v0` — Initial value of residual variance
random number (default) | positive scalar value

`Options` — Options for iterations
structure

`coeff` — Principal component coefficients
p-by-k matrix

`score` — Principal component scores
n-by-k matrix

`pcvar` — Principal component variances
column vector

`mu` — Estimated mean
row vector

`v` — Isotropic residual variance
scalar value

`S` — Final results at convergence
structure