cuffcompare

Compare assembled transcripts across multiple experiments

Syntax

statsFile = cuffcompare(gtfFiles)

statsFile = cuffcompare(gtfFiles,compareOptions)

statsFile = cuffcompare(gtfFiles,Name,Value)

[statsFile,combinedGTF,lociFile,trackingFile] = cuffcompare(___)

Description

statsFile = cuffcompare(gtfFiles) compares the assembled transcripts in gtfFiles and returns summary statistics in the output file statsFile [1].

cuffcompare requires the Cufflinks Support Package for the Bioinformatics Toolbox™. If the support package is not installed, then the function provides a download link. For details, see Bioinformatics Toolbox Software Support Packages.

example

statsFile = cuffcompare(gtfFiles,compareOptions) uses additional options specified by compareOptions.

statsFile = cuffcompare(gtfFiles,Name,Value) uses additional options specified by one or more name-value pair arguments. For example, statsFile = cuffcompare(gtfFile,'OutputPrefix',"cuffComp") appends the prefix "cuffComp" to the output file names.

[statsFile,combinedGTF,lociFile,trackingFile] = cuffcompare(___) returns the names of the output files using any of the input argument combinations in the previous syntaxes. By default, the function saves all files to the current directory.

Examples

collapse all

Assemble Transcriptome and Perform Differential Expression Testing

Create a CufflinksOptions object to define cufflinks options, such as the number of parallel threads and the output directory to store the results.

cflOpt = CufflinksOptions;
cflOpt.NumThreads = 8;
cflOpt.OutputDirectory = "./cufflinksOut";

The SAM files provided for this example contain aligned reads for Mycoplasma pneumoniae from two samples with three replicates each. The reads are simulated 100bp-reads for two genes (gyrA and gyrB) located next to each other on the genome. All the reads are sorted by reference position, as required by cufflinks.

sams = ["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam",...
        "Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"];

Assemble the transcriptome from the aligned reads.

[gtfs,isofpkm,genes,skipped] = cufflinks(sams,cflOpt);

gtfs is a list of GTF files that contain assembled isoforms.

Compare the assembled isoforms using cuffcompare.

stats = cuffcompare(gtfs);

Merge the assembled transcripts using cuffmerge.

mergedGTF = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput');

mergedGTF reports only one transcript. This is because the two genes of interest are located next to each other, and cuffmerge cannot distinguish two distinct genes. To guide cuffmerge, use a reference GTF (gyrAB.gtf) containing information about these two genes. If the file is not located in the same directory that you run cuffmerge from, you must also specify the file path.

gyrAB = which('gyrAB.gtf');
mergedGTF2 = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput2',...
			'ReferenceGTF',gyrAB);

Calculate abundances (expression levels) from aligned reads for each sample.

abundances1 = cuffquant(mergedGTF2,["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],...
                        'OutputDirectory','./cuffquantOutput1');
abundances2 = cuffquant(mergedGTF2,["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"],...
                        'OutputDirectory','./cuffquantOutput2');

Assess the significance of changes in expression for genes and transcripts between conditions by performing the differential testing using cuffdiff. The cuffdiff function operates in two distinct steps: the function first estimates abundances from aligned reads, and then performs the statistical analysis. In some cases (for example, distributing computing load across multiple workers), performing the two steps separately is desirable. After performing the first step with cuffquant, you can then use the binary CXB output file as an input to cuffdiff to perform statistical analysis. Because cuffdiff returns several files, specify the output directory is recommended.

isoformDiff = cuffdiff(mergedGTF2,[abundances1,abundances2],...
                      'OutputDirectory','./cuffdiffOutput');

Display a table containing the differential expression test results for the two genes gyrB and gyrA.

readtable(isoformDiff,'FileType','text')

ans =

  2×14 table

        test_id            gene_id        gene              locus             sample_1    sample_2    status     value_1       value_2      log2_fold_change_    test_stat    p_value    q_value    significant
    ________________    _____________    ______    _______________________    ________    ________    ______    __________    __________    _________________    _________    _______    _______    ___________

    'TCONS_00000001'    'XLOC_000001'    'gyrB'    'NC_000912.1:2868-7340'      'q1'        'q2'       'OK'     1.0913e+05    4.2228e+05          1.9522           7.8886      5e-05      5e-05        'yes'   
    'TCONS_00000002'    'XLOC_000001'    'gyrA'    'NC_000912.1:2868-7340'      'q1'        'q2'       'OK'     3.5158e+05    1.1546e+05         -1.6064          -7.3811      5e-05      5e-05        'yes'

You can use cuffnorm to generate normalized expression tables for further analyses. cuffnorm results are useful when you have many samples and you want to cluster them or plot expression levels for genes that are important in your study. Note that you cannot perform differential expression analysis using cuffnorm.

Specify a cell array, where each element is a string vector containing file names for a single sample with replicates.

alignmentFiles = {["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],...
                  ["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"]}
isoformNorm = cuffnorm(mergedGTF2, alignmentFiles,...
                      'OutputDirectory', './cuffnormOutput');

Display a table containing the normalized expression levels for each transcript.

readtable(isoformNorm,'FileType','text')

ans =

  2×7 table

      tracking_id          q1_0          q1_2          q1_1          q2_1          q2_0          q2_2   
    ________________    __________    __________    __________    __________    __________    __________

    'TCONS_00000001'    1.0913e+05         78628    1.2132e+05    4.3639e+05    4.2228e+05    4.2814e+05
    'TCONS_00000002'    3.5158e+05    3.7458e+05    3.4238e+05    1.0483e+05    1.1546e+05    1.1105e+05

Column names starting with q have the format: conditionX_N, indicating that the column contains values for replicate N of conditionX.

Input Arguments

collapse all

`gtfFiles` — Names of GTF files
string array | cell array of character vectors

Names of GTF files, specified as a string vector or cell array of character vectors. Each GTF file corresponds to a sample produced by cufflinks.

Example: ["Myco_1_1.transcripts.gtf","Myco_2_1.transcripts.gtf"]

Data Types: string | cell

`compareOptions` — `cuffcompare` options
`CuffCompareOptions` object | character vector | string

cuffcompare options, specified as a CuffCompareOptions object, character vector, or string. The character vector or string must be in the original cuffcompare option syntax (prefixed by one or two dashes), such as '-d 100 -e 80' [1].

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: statsFile = cuffcompare(gtfFile,'OutputPrefix',"cuffComp",'MaxGroupingRange',90)

`ConsensusPrefix` — Prefix for consensus transcript names
`"TCONS"` (default) | string | character vector

Prefix for consensus transcript names in the output combined.gtf file, specified as a string or character vector. This option must be a string or character vector with a non-zero length.

Example: 'ConsensusPrefix',"consensusTs"

Data Types: char | string

`DiscardIntronRedundant` — Flag to ignore intron-redundant transfrags
`false` (default) | `true`

Flag to ignore intron-redundant transfrags if they have the same 5' end but different 3' ends, specified as true or false.

Example: 'DiscardIntronRedundant',true

Data Types: logical

`DiscardSingleExonAll` — Flag to discard single-exon transfrags and reference transcripts
`false` (default) | `true`

Flag to discard single-exon transfrags and reference transcripts, specified as true or false.

Example: 'DiscardSingleExonAll',true

Data Types: logical

`DiscardSingleExonReference` — Flag to discard single-exon reference transcripts
`false` (default) | `true`

Flag to discard single-exon reference transcripts, specified as true or false.

Example: 'DiscardSingleExonReference',true

Data Types: logical

`ExtraCommand` — Additional commands
`""` (default) | character vector | string

The commands must be in the native syntax (prefixed by one or two dashes). Use this option to apply undocumented flags and flags without corresponding MATLAB^® properties.

Example: 'ExtraCommand',"--library-type fr-secondstrand"

Data Types: char | string

`GTFManifest` — Name of text file containing list of GTF files to process
string | character vector

Name of the text file containing a list of GTF files to process, specified as a string or character vector. The file must contain one GTF file path per line. You can use this option as an alternative to passing an array of file names to cuffcompare.

Example: 'GTFManifest',"gtfManifestFile.txt"

Data Types: char | string

`GenericGFF` — Flag to treat input GTF files as GFF
`false` (default) | `true`

Flag to treat input GTF files as GFF files, specified as true or false. Use this option when the input GFF or GTF files are not produced by cufflinks.

Example: 'GenericGFF',true

Data Types: logical

`IncludeAll` — Flag to include all available options
`false` (default) | `true`

The original (native) syntax is prefixed by one or two dashes. By default, the function converts only the specified options. If the value is true, the software converts all available options, with default values for unspecified options, to the original syntax.

Note

If you set IncludeAll to true, the software converts all available properties, using default values for unspecified properties. The only exception is when the default value of a property is NaN, Inf, [], '', or "". In this case, the software does not translate the corresponding property.

Example: 'IncludeAll',true

Data Types: logical

`IncludeContained` — Flag to include transfrags contained by other transfrags
`false` (default) | `true`

Flag to include transfrags contained by other transfrags in the same locus in the output combined.gtf, specified as true or false. By default, cuffcompare does not include these contained transfrags. If the value is true, the contained transfrags include a contained_in attribute indicating the first container transfrag found.

Example: 'IncludeContained',true

Data Types: logical

`MaxAccuracyRange` — Number of bases from terminal exons to use when assessing exon accuracy
`100` (default) | positive integer

Number of bases from the free ends of terminal exons to use when assessing exon accuracy, specified as a positive integer.

Example: 'MaxAccuracyRange',80

Data Types: double

`MaxGroupingRange` — Number of bases to use for grouping transcript start sites
`100` (default) | positive integer

Number of bases to use for grouping transcript start sites, specified as a positive integer.

Example: 'MaxGroupingRange',90

Data Types: double

`OutputPrefix` — Prefix for `cuffcompare` output files
`"cuffcmp"` (default) | string | character vector

Prefix for cuffcompare output files, specified as a string or character vector. This option must be a string or character vector with a non-zero length.

Example: 'OutputPrefix',"cuffcompareOut"

Data Types: char | string

`ReferenceGTF` — Name of GTF or GFF file containing reference transcripts
string | character vector

Name of the GTF or GFF file containing reference transcripts to compare to each sample, specified as a string or character vector. If you provide a file, the function compares each sample to the references in the file and marks isoforms as overlapping, matching, or novel. The function stores these tags in the output files .refmap and .tmap files.

Example: 'ReferenceGTF',"references.gtf"

Data Types: char | string

`SequenceDirectory` — Name of directory containing FASTA sequences to classify input transcripts as repeats
string | character vector

Name of directory containing FASTA sequences to classify input transcripts as repeats, specified as a string or character vector. The directory must contain FASTA-format files with the underlying genomic sequences and contain one FASTA file per reference. Name each FASTA file after the chromosome with the extension .fa or .fasta.

Example: 'SequenceDirectory',"./SequenceDirectory/"

Data Types: char | string

`SnCorrection` — Flag to consider only reference transcripts that overlap with input transfrags
`false` (default) | `true`

Flag to consider only reference transcripts that overlap with any of the input transfrags, specified as true or false. If the value is true:

The function ignores any reference transcripts that do not overlap with any of the input transfrags.
You must also specify the ReferenceGTF option.

Example: 'SnCorrection',true

Data Types: logical

`SpCorrection` — Flag to consider only input transcripts that overlap with reference transcripts
`false` (default) | `true`

Flag to consider only input transcripts that overlap with any of the reference transcripts, specified as true or false. If the value is true:

The function ignores any input transcripts that do not overlap with any of the reference transcripts and reports no novel loci.
You must also specify the ReferenceGTF option.

Example: 'SpCorrection',true

Data Types: logical

`SuppressMapFiles` — Flag to prevent creation of `.tmap` and `.refmap` files
`false` (default) | `true`

Flag to prevent the creation of .tmap and .refmap files, specified as true or false. Set the value to true to prevent the function from generating the files.

Example: 'SuppressMapFiles',true

Data Types: logical

Output Arguments

collapse all

`statsFile` — Name of text file containing statistics
`"cuffcmp.stats"`

Name of the text file containing statistics related to the accuracy of the transcripts in each sample, returned as string. The function performs the tests for sensitivity (Sn) and specificity (Sp) at various levels, including the nucleotide, exon, and intron levels, and reports the results in this file.

The default file name is "cuffcmp.stats". If you specify OutputPrefix, the function uses it instead of "cuffcmp".

`combinedGTF` — Name of file containing union of all transfrags in each sample
`"cuffcmp.combined.gtf"`

Name of the file containing the union of all transfrags in each sample, returned as a string.

The default file name is "cuffcmp.combined.gtf". If you specify OutputPrefix, the function uses it instead of "cuffcmp".

`lociFile` — Name of file with all processed loci
`"cuffcmp.loci"`

Name of file with all processed loci across all transcripts, returned as a string.

The default file name is "cuffcmp.loci". If you specify OutputPrefix, the function uses it instead of "cuffcmp".

`trackingFile` — Name of file containing transcripts with identical coordinates
`"cuffcmp.tracking"`

Name of the file containing transcripts with identical coordinates, introns, and strands, returned as a string.

The default file name is "cuffcmp.tracking". If you specify OutputPrefix, the function uses it instead of "cuffcmp".

References

[1] Trapnell, Cole, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. “Transcript Assembly and Quantification by RNA-Seq Reveals Unannotated Transcripts and Isoform Switching during Cell Differentiation.” Nature Biotechnology 28, no. 5 (May 2010): 511–15.

Version History

Introduced in R2019a

cuffcompare

Syntax

Description

Examples

Assemble Transcriptome and Perform Differential Expression Testing

Input Arguments

`gtfFiles` — Names of GTF files
string array | cell array of character vectors

`compareOptions` — `cuffcompare` options
`CuffCompareOptions` object | character vector | string

Name-Value Arguments

`ConsensusPrefix` — Prefix for consensus transcript names
`"TCONS"` (default) | string | character vector

`DiscardIntronRedundant` — Flag to ignore intron-redundant transfrags
`false` (default) | `true`

`DiscardSingleExonAll` — Flag to discard single-exon transfrags and reference transcripts
`false` (default) | `true`

`DiscardSingleExonReference` — Flag to discard single-exon reference transcripts
`false` (default) | `true`

`ExtraCommand` — Additional commands
`""` (default) | character vector | string

`GTFManifest` — Name of text file containing list of GTF files to process
string | character vector

`GenericGFF` — Flag to treat input GTF files as GFF
`false` (default) | `true`

`IncludeAll` — Flag to include all available options
`false` (default) | `true`

`IncludeContained` — Flag to include transfrags contained by other transfrags
`false` (default) | `true`

`MaxAccuracyRange` — Number of bases from terminal exons to use when assessing exon accuracy
`100` (default) | positive integer

`MaxGroupingRange` — Number of bases to use for grouping transcript start sites
`100` (default) | positive integer

`OutputPrefix` — Prefix for `cuffcompare` output files
`"cuffcmp"` (default) | string | character vector

`ReferenceGTF` — Name of GTF or GFF file containing reference transcripts
string | character vector

`SequenceDirectory` — Name of directory containing FASTA sequences to classify input transcripts as repeats
string | character vector

`SnCorrection` — Flag to consider only reference transcripts that overlap with input transfrags
`false` (default) | `true`

`SpCorrection` — Flag to consider only input transcripts that overlap with reference transcripts
`false` (default) | `true`

`SuppressMapFiles` — Flag to prevent creation of `.tmap` and `.refmap` files
`false` (default) | `true`

Output Arguments

`statsFile` — Name of text file containing statistics
`"cuffcmp.stats"`

`combinedGTF` — Name of file containing union of all transfrags in each sample
`"cuffcmp.combined.gtf"`

`lociFile` — Name of file with all processed loci
`"cuffcmp.loci"`

`trackingFile` — Name of file containing transcripts with identical coordinates
`"cuffcmp.tracking"`

References

Version History

See Also

Topics

External Websites

cuffcompare

Syntax

Description

Examples

Assemble Transcriptome and Perform Differential Expression Testing

Input Arguments

gtfFiles — Names of GTF files string array | cell array of character vectors

compareOptions — cuffcompare options CuffCompareOptions object | character vector | string

Name-Value Arguments

ConsensusPrefix — Prefix for consensus transcript names "TCONS" (default) | string | character vector

DiscardIntronRedundant — Flag to ignore intron-redundant transfrags false (default) | true

DiscardSingleExonAll — Flag to discard single-exon transfrags and reference transcripts false (default) | true

DiscardSingleExonReference — Flag to discard single-exon reference transcripts false (default) | true

ExtraCommand — Additional commands "" (default) | character vector | string

GTFManifest — Name of text file containing list of GTF files to process string | character vector

GenericGFF — Flag to treat input GTF files as GFF false (default) | true

IncludeAll — Flag to include all available options false (default) | true

IncludeContained — Flag to include transfrags contained by other transfrags false (default) | true

MaxAccuracyRange — Number of bases from terminal exons to use when assessing exon accuracy 100 (default) | positive integer

MaxGroupingRange — Number of bases to use for grouping transcript start sites 100 (default) | positive integer

OutputPrefix — Prefix for cuffcompare output files "cuffcmp" (default) | string | character vector

ReferenceGTF — Name of GTF or GFF file containing reference transcripts string | character vector

SequenceDirectory — Name of directory containing FASTA sequences to classify input transcripts as repeats string | character vector

SnCorrection — Flag to consider only reference transcripts that overlap with input transfrags false (default) | true

SpCorrection — Flag to consider only input transcripts that overlap with reference transcripts false (default) | true

SuppressMapFiles — Flag to prevent creation of .tmap and .refmap files false (default) | true

Output Arguments

statsFile — Name of text file containing statistics "cuffcmp.stats"

combinedGTF — Name of file containing union of all transfrags in each sample "cuffcmp.combined.gtf"

lociFile — Name of file with all processed loci "cuffcmp.loci"

trackingFile — Name of file containing transcripts with identical coordinates "cuffcmp.tracking"

References

Version History

See Also

Topics

External Websites

`gtfFiles` — Names of GTF files
string array | cell array of character vectors

`compareOptions` — `cuffcompare` options
`CuffCompareOptions` object | character vector | string

`ConsensusPrefix` — Prefix for consensus transcript names
`"TCONS"` (default) | string | character vector

`DiscardIntronRedundant` — Flag to ignore intron-redundant transfrags
`false` (default) | `true`

`DiscardSingleExonAll` — Flag to discard single-exon transfrags and reference transcripts
`false` (default) | `true`

`DiscardSingleExonReference` — Flag to discard single-exon reference transcripts
`false` (default) | `true`

`ExtraCommand` — Additional commands
`""` (default) | character vector | string

`GTFManifest` — Name of text file containing list of GTF files to process
string | character vector

`GenericGFF` — Flag to treat input GTF files as GFF
`false` (default) | `true`

`IncludeAll` — Flag to include all available options
`false` (default) | `true`

`IncludeContained` — Flag to include transfrags contained by other transfrags
`false` (default) | `true`

`MaxAccuracyRange` — Number of bases from terminal exons to use when assessing exon accuracy
`100` (default) | positive integer

`MaxGroupingRange` — Number of bases to use for grouping transcript start sites
`100` (default) | positive integer

`OutputPrefix` — Prefix for `cuffcompare` output files
`"cuffcmp"` (default) | string | character vector

`ReferenceGTF` — Name of GTF or GFF file containing reference transcripts
string | character vector

`SequenceDirectory` — Name of directory containing FASTA sequences to classify input transcripts as repeats
string | character vector

`SnCorrection` — Flag to consider only reference transcripts that overlap with input transfrags
`false` (default) | `true`

`SpCorrection` — Flag to consider only input transcripts that overlap with reference transcripts
`false` (default) | `true`

`SuppressMapFiles` — Flag to prevent creation of `.tmap` and `.refmap` files
`false` (default) | `true`

`statsFile` — Name of text file containing statistics
`"cuffcmp.stats"`

`combinedGTF` — Name of file containing union of all transfrags in each sample
`"cuffcmp.combined.gtf"`

`lociFile` — Name of file with all processed loci
`"cuffcmp.loci"`

`trackingFile` — Name of file containing transcripts with identical coordinates
`"cuffcmp.tracking"`