seqpdist
Calculate pairwise distance between sequences
Syntax
D
= seqpdist(Seqs
)
D
= seqpdist(Seqs
,
...'PropertyName
', PropertyValue
,
...)
D
= seqpdist(Seqs
,
...'Method', MethodValue
, ...)
D
= seqpdist(Seqs
,
...'Indels', IndelsValue
, ...)
D
= seqpdist(Seqs
,
...'OptArgs', OptArgsValue
, ...)
D
= seqpdist(Seqs
,
...'PairwiseAlignment', PairwiseAlignmentValue
,
...)
D
= seqpdist(Seqs
,
...'UseParallel', UseParallelValue
, ...)
D
= seqpdist(Seqs
,
...'SquareForm', SquareFormValue
...)
D
= seqpdist(Seqs
,
...'Alphabet', AlphabetValue
, ...)
D
= seqpdist(Seqs
,
...'ScoringMatrix', ScoringMatrixValue
, ...)
D
= seqpdist(Seqs
,
...'Scale', ScaleValue
, ...)
D
= seqpdist(Seqs
,
...'GapOpen', GapOpenValue
, ...)
D
= seqpdist(Seqs
,
...'ExtendGap', ExtendGapValue
, ...)
Input Arguments
Seqs | Any of the following:
|
MethodValue | Character vector or string that specifies the method to calculate pairwise distances. Default
is 'Jukes-Cantor' . |
IndelsValue | Character vector or string that specifies how to treat sites with gaps. Default is
'score' . |
OptArgsValue | Character vector or cell array that specifies one or more input arguments required or
accepted by the distance method specified by the
Method property. |
PairwiseAlignmentValue | Controls the global pairwise alignment of input sequences (using
the nwalign function), while
ignoring the multiple alignment of the input sequences (if any). Choices
are true or false . Default is:
Tip If your input sequences are the same length,
|
UseParallelValue | Controls the calculation of the pairwise distances using parfor -loops.
When true , and Parallel Computing Toolbox™ is
installed and a parpool is open, computation occurs
in parallel. If there are no open parpool , but
automatic creation is enabled in the Parallel Preferences, the default
pool will be automatically open and computation occurs in parallel.
If Parallel Computing Toolbox is installed, but there are no open parpool and
automatic creation is disabled, then computation uses parfor -loops
in serial mode. If Parallel Computing Toolbox is not installed,
then computation uses parfor -loops in serial mode.
Default is false , which uses for-loops in serial
mode. |
SquareFormValue | Controls the conversion of the output into a square matrix.
Choices are |
AlphabetValue | Character vector or string specifying the type of sequence (nucleotide or amino acid).
Choices are 'NT' or
'AA' (default). |
ScoringMatrixValue | Either of the following:
Note If you need to compile
|
ScaleValue | Positive value that specifies the scale factor used to return the score in arbitrary units. If the scoring matrix information also provides a scale factor, then both are used. |
GapOpenValue | Positive integer that specifies the penalty for opening a gap
in the alignment. Default is 8 . |
ExtendedGapValue | Positive integer that specifies the penalty for extending a
gap. Default is equal to GapOpenValue . |
Output Arguments
D | Vector that contains biological distances between each pair
of sequences stored in the M elements of Seqs . |
Description
returns D
= seqpdist(Seqs
)D
,
a vector containing biological distances between each pair of sequences
stored in the M
sequences of Seqs
,
a cell array of sequences, a vector of structures, or a matrix or
sequences.
is a D
1
-by-(M*(M-1)/2)
row
vector corresponding to the M*(M-1)/2
pairs of
sequences in Seqs
. The output
is
arranged in the order D
((2,1),(3,1),..., (M,1),(3,2),...(M,2),...(M,M-1))
.
This is the lower-left triangle of the full M
-by-M
distance
matrix. To get the distance between the I
th
and the J
th sequences for I >
J
, use the formula D((J-1)*(M-J/2)+I-J)
.
calls D
= seqpdist(Seqs
,
...'PropertyName
', PropertyValue
,
...)seqpdist
with optional properties
that use property name/property value pairs. Specify one or more properties
in any order. Enclose each PropertyName
in
single quotation marks. Each PropertyName
is
case insensitive. These property name/property value pairs are as
follows:
specifies
a method to compute distances between each sequence pair. Choices
are shown in the following tables.D
= seqpdist(Seqs
,
...'Method', MethodValue
, ...)
Methods for Nucleotides and Amino Acids
Method | Description |
---|---|
p-distance | Proportion of sites at which the two sequences are different. p is
close to 1 for poorly related sequences, and p is
close to 0 for similar sequences.d = p |
Jukes-Cantor (default) | Maximum likelihood estimate of the number of substitutions
between two sequences. For nucleotides:
For amino acids:
|
alignment-score | Distance (d ) between two sequences (1,
2 ) is computed from the pairwise alignment score between
the two sequences (score12 ), and the pairwise alignment
score between each sequence and itself (score11 , score22 )
as follows:d = (1-score12/score11)* (1-score12/score22) d = 0 |
Methods with No Scoring of Gaps (Nucleotides Only)
Method | Description |
---|---|
Tajima-Nei | Maximum likelihood estimate considering the background nucleotide
frequencies. It can be computed from the input sequences or given
by setting OptArgs to [gA gC gG gT] . gA , gC , gG , gT are
scalar values for the nucleotide frequencies. |
Kimura | Considers separately the transitional nucleotide substitution and the transversional nucleotide substitution. |
Tamura | Considers separately the transitional nucleotide substitution,
the transversional nucleotide substitution, and the GC content. GC
content can be computed from the input sequences or given by setting OptArgs to
the proportion of GC content (scalar value from 0 to 1 ). |
Hasegawa | Considers separately the transitional nucleotide substitution,
the transversional nucleotide substitution, and the background nucleotide
frequencies. Background frequencies can be computed from the input
sequences or given by setting the OptArgs property
to [gA gC gG gT] . |
Nei-Tamura | Considers separately the transitional nucleotide substitution
between purines, the transitional nucleotide substitution between
pyrimidines, the transversional nucleotide substitution, and the background
nucleotide frequencies. Background frequencies can be computed from
the input sequences or given by setting the OptArgs property
to [gA gC gG gT] . |
Methods with No Scoring of Gaps (Amino Acids Only)
Method | Description |
---|---|
Poisson | Assumes that the number of amino acid substitutions at each site has a Poisson distribution. |
Gamma | Assumes that the number of amino acid substitutions at each
site has a Gamma distribution with parameter a .
Set a using the OptArgs property.
Default is 2 . |
You can also specify a user-defined distance function using @
,
for example, @distfun
. The distance function must
have the form:
function D = distfun(S1, S2, OptArgsValue)
The distfun
function takes the following
arguments:
S1
,S2
— Two sequences of the same length (nucleotide or amino acid).OptArgsValue
— Optional problem-dependent arguments.
The distfun
function returns a scalar that
represents the distance between S1
and S2
.
specifies
how to treat sites with gaps. Choices are:D
= seqpdist(Seqs
,
...'Indels', IndelsValue
, ...)
score
(default) — Scores these sites either as a point mutation or with the alignment parameters, depending on the method selected.pairwise-del
— For every pairwise comparison, it ignores the sites with gaps.complete-del
— Ignores all the columns in the multiple alignment that contain a gap. This option is available only if you provided a multiple alignment as the inputSeqs
.
passes
one or more arguments required or accepted by the distance method
specified by the D
= seqpdist(Seqs
,
...'OptArgs', OptArgsValue
, ...)Method
property. Use a character
vector or cell array to pass one or more input arguments. For example,
provide the nucleotide frequencies for the Tajima-Nei
distance
method, instead of computing them from the input sequences.
controls the global pairwise alignment of input sequences
(using the D
= seqpdist(Seqs
,
...'PairwiseAlignment', PairwiseAlignmentValue
,
...)nwalign
function),
while ignoring the multiple alignment of the input sequences (if any).
Default is:
true
— When all input sequences do not have the same length.false
— When all input sequences have the same length.
Tip
If your input sequences have the same length, seqpdist
assumes
they are aligned. If they are not aligned, do one of the following:
Align the sequences before passing them to
seqpdist
, for example, using themultialign
function.Set
PairwiseAlignment
totrue
when usingseqpdist
.
specifies
whether to use D
= seqpdist(Seqs
,
...'UseParallel', UseParallelValue
, ...)parfor
-loops when calculating the
pairwise distances. When true
, and Parallel Computing Toolbox is
installed and a parpool
is open, computation occurs
in parallel. If there are no open parpool
, but
automatic creation is enabled in the Parallel Preferences, the default
pool will be automatically open and computation occurs in parallel.
If Parallel Computing Toolbox is installed, but there are no open parpool
and
automatic creation is disabled, then computation uses parfor
-loops
in serial mode. If Parallel Computing Toolbox is not installed,
then computation uses parfor
-loops in serial mode.
Default is false
, which uses for-loops in serial
mode.
controls
the conversion of the output into a square matrix such that D
= seqpdist(Seqs
,
...'SquareForm', SquareFormValue
...)
denotes
the distance between the D
(I
,J
)I
th and J
th
sequences. The square matrix is symmetric and has a zero diagonal.
Choices are true
or false
(default).
Setting Squareform
to true
is
the same as using the squareform
function in Statistics and Machine Learning Toolbox™ .
specifies
the type of sequence (nucleotide or amino acid). Choices are D
= seqpdist(Seqs
,
...'Alphabet', AlphabetValue
, ...)'NT'
or 'AA'
(default).
The remaining input properties are available when the Method
property
equals 'alignment-score'
or the PairwiseAlignment
property
equals true
.
specifies the scoring matrix to use for
the global pairwise alignment. Default is:D
= seqpdist(Seqs
,
...'ScoringMatrix', ScoringMatrixValue
, ...)
'NUC44'
— WhenAlphabetValue
equals'NT'
.'BLOSUM50'
— WhenAlphabetValue
equals'AA'
.
specifies
the scale factor used to return the score in arbitrary units. Choices
are any positive value. If the scoring matrix information also provides
a scale factor, then both are used.D
= seqpdist(Seqs
,
...'Scale', ScaleValue
, ...)
specifies
the penalty for opening a gap in the alignment. Choices are any positive
integer. Default is D
= seqpdist(Seqs
,
...'GapOpen', GapOpenValue
, ...)8
.
specifies
the penalty for extending a gap in the alignment. Choices are any
positive integer. Default is equal to D
= seqpdist(Seqs
,
...'ExtendGap', ExtendGapValue
, ...)GapOpenValue
.
Examples
Read amino acid alignment data into a MATLAB structure.
seqs = fastaread('pf00002.fa');
For every possible pair of sequences in the multiple alignment, ignore sites with gaps and score with the scoring matrix
PAM250
.dist = seqpdist(seqs,'Method','alignment-score',... 'Indels','pairwise-delete',... 'ScoringMatrix','pam250');
Force the realignment of each sequence pair ignoring the provided multiple alignment.
dist = seqpdist(seqs,'Method','alignment-score',... 'Indels','pairwise-delete',... 'ScoringMatrix','pam250',... 'PairwiseAlignment',true);
Measure the Jukes-Cantor pairwise distances after realigning each sequence pair, counting the gaps as point mutations.
dist = seqpdist(seqs,'Method','jukes-cantor',... 'Indels','score',... 'Scoringmatrix','pam250',... 'PairwiseAlignment',true);
Extended Capabilities
Version History
Introduced before R2006a