Comparing Whole Genomes
This example shows how to compare whole genomes for organisms, which allows you to compare the organisms at a very different resolution relative to single gene comparisons. Instead of just focusing on the differences between homologous genes you can gain insight into the large-scale features of genomic evolution.
This example uses two strains of Chlamydia, Chlamydia trachomatis and Chlamydophila pneumoniae. These are closely related bacteria that cause different, though both very common, diseases in humans. Whole genomes are available in the GenBank® database for both organisms.
Retrieving the Genomes
You can download these genomes using the getgenbank
function. First, download the Chlamydia trachomatis genome. Notice that the genome is circular and just over one million bp in length. These sequences are quite large so may take a while to download.
seqtrachomatis = getgenbank('NC_000117');
Next, download Chlamydophila pneumoniae. This genome is also circular and a little longer at 1.2 Mbp.
seqpneumoniae = getgenbank('NC_002179');
For your convenience, previously downloaded sequences are included in a MAT-file. Note that data in public repositories is frequently curated and updated. Hence, the results of this example might be slightly different when you use up-to-date datasets.
load('chlamydia.mat','seqtrachomatis','seqpneumoniae')
A very simple approach for comparing the two genomes is to perform pairwise alignment between all genes in the genomes. Given that these are bacterial genomes, a simple approach would be to compare all ORFs in the two genomes. However, the GenBank data includes more information about the genes in the sequences. This is stored in the CDS field of the data structure. Chlamydia trachomatis has 895 coding regions, while Chlamydophila pneumoniae has 1112.
M = numel(seqtrachomatis.CDS) N = numel(seqpneumoniae.CDS)
M = 895 N = 1112
Most of the CDS records contain the translation to amino acid sequences. The first CDS record in the Chlamydia trachomatis data is a hypothetical protein of length 591 residues.
seqtrachomatis.CDS(1)
ans = struct with fields: location: 'join(1041920..1042519,1..1176)' gene: [] product: 'hypothetical protein' codon_start: '1' indices: [1041920 1042519 1 1176] protein_id: 'NP_219502.1' db_xref: 'GeneID:884145' note: [] translation: 'MSIRGVGGNGNSRIPSHNGDGSNRRSQNTKGNNKVEDRVCSLYSSRSNENRESPYAVVDVSSMIESTPTSGETTRASRGVFSRFQRGLVRVADKVRRAVQCAWSSVSTRRSSATRAAESGSSSRTARGASSGYREYSPSAARGLRLMFTDFWRTRVLRQTSPMAGVFGNLDVNEARLMAAYTSECADHLEANKLAGPDGVAAAREIAKRWEQRVRDLQDKGAARKLLNDPLGRRTPNYQSKNPGEYTVGNSMFYDGPQVANLQNVDTGFWLDMSNLSDVVLSREIQTGLRARATLEESMPMLENLEERFRRLQETCDAARTEIEESGWTRESASRMEGDEAQGPSRAQQAFQSFVNECNSIEFSFGSFGEHVRVLCARVSRGLAAAGEAIRRCFSCCKGSTHRYAPRDDLSPEGASLAETLARFADDMGIERGADGTYDIPLVDDWRRGVPSIEGEGSDSIYEIMMPIYEVMDMDLETRRSFAVQQGHYQDPRASDYDLPRASDYDLPRSPYPTPPLPPRYQLQNMDVEAGFREAVYASFVAGMYNYVVTQPQERIPNSQQVEGILRDMLTNGSQTFRDLMRRWNREVDRE' text: [19x58 char]
The fourth CDS record is for the gatA gene, which has product glutamyl-tRNA amidotransferase subunit A. The length of the product sequence is 491 residues.
seqtrachomatis.CDS(4)
ans = struct with fields: location: '2108..3583' gene: 'gatA' product: [2x47 char] codon_start: '1' indices: [2108 3583] protein_id: 'NP_219505.1' db_xref: 'GeneID:884087' note: [7x58 char] translation: 'MYRKSALELRDAVVNRELSVTAITEYFYHRIESHDEQIGAFLSLCKERALLRASRIDDKLAKGDPIGLLAGIPIGVKDNIHITGVKTTCASKMLENFVAPFDSTVVRRIEMEDGILLGKLNMDEFAMGSTTRYSAFHPTNNPWDLERVPGGSSGGSAAAVSARFCPIALGSDTGGSIRQPAAFCGVVGFKPSYGAVSRYGLVAFGSSLDQIGPLTTVVEDVALAMDAFAGRDPKDSTTRDFFKGTFSQALSLEVPKLIGVPRGFLDGLQEDCKENFFEALAVMEREGSRIIDVDLSVLKHAVPVYYIVASAEAATNLARFDGVRYGHRCAQADNMHEMYARSRKEGFGKEVTRRILLGNYVLSAERQNIFYKKGMAVRARLIDAFQAAFERCDVIAMPVCATPAIRDQDVLDPVSLYLQDVYTVAVNLAYLPAISVPSGLSKEGLPLGVQFIGERGSDQQICQVGYSFQEHSQIKQLYPKAVNGLFDGGIE' text: [26x58 char]
A few of the Chlamydophila pneumoniae CDS have empty translations. Fill them in as follows. First, find all empty translations, then display the first empty translation.
missingPn = find(cellfun(@isempty,{seqpneumoniae.CDS.translation})); seqpneumoniae.CDS(missingPn(1))
ans = struct with fields: location: 'complement(73364..73477)' gene: [] product: 'hypothetical protein' codon_start: '1' indices: [73477 73364] protein_id: 'NP_444613.1' db_xref: 'GeneID:963699' note: 'hypothetical protein; identified by Glimmer2' translation: [] text: [10x52 char]
The function featureparse
extracts features, such as the CDS, from the sequence structure. You can then use cellfun
to apply nt2aa
to the sequences with missing translations.
allCDS = featureparse(seqpneumoniae,'Feature','CDS','Sequence',true); missingSeqs = cellfun(@nt2aa,{allCDS(missingPn).Sequence},'uniform',false); [seqpneumoniae.CDS(missingPn).translation] = deal(missingSeqs{:}); seqpneumoniae.CDS(missingPn(1))
ans = struct with fields: location: 'complement(73364..73477)' gene: [] product: 'hypothetical protein' codon_start: '1' indices: [73477 73364] protein_id: 'NP_444613.1' db_xref: 'GeneID:963699' note: 'hypothetical protein; identified by Glimmer2' translation: 'MLTDQRKHIQMLHKHNSIEIFLSNMVVEVKLFFKTLK*' text: [10x52 char]
Performing Gene Comparisons
To compare the gatA gene in Chlamydia trachomatis with all the CDS genes in Chlamydophila pneumoniae, put a for
loop around the nwalign
function. You could alternatively use local alignment (swalign
).
tic gatAScores = zeros(1,N); for inner = 1:N gatAScores(inner) = nwalign(seqtrachomatis.CDS(4).translation,... seqpneumoniae.CDS(inner).translation); end toc % |tic| and |toc| are used to report how long the calculation takes.
Elapsed time is 2.058605 seconds.
A histogram of the scores shows a large number of negative scores and one very high positive score.
histogram(gatAScores,100) title(sprintf(['Alignment Scores for Chlamydia trachomatis %s\n',... 'with all CDS in Chlamydophila pneumoniae'],seqtrachomatis.CDS(4).gene))
As expected, the high scoring match is with the gatA gene in Chlamydophila pneumoniae.
[gatABest, gatABestIdx] = max(gatAScores); seqpneumoniae.CDS(gatABestIdx)
ans = struct with fields: location: 'complement(838828..840306)' gene: 'gatA' product: [2x47 char] codon_start: '1' indices: [840306 838828] protein_id: 'NP_445311.1' db_xref: 'GeneID:963139' note: [7x58 char] translation: 'MYRYSALELAKAVTLGELTATGVTQHFFHRIEEAEGQVGAFISLCKEQALEQAELIDKKRSRGEPLGKLAGVPVGIKDNIHVTGLKTTCASRVLENYQPPFDATVVERIKKEDGIILGKLNMDEFAMGSTTLYSAFHPTHNPWDLSRVPGGSSGGSAAAVSARFCPVALGSDTGGSIRQPAAFCGVVGFKPSYGAVSRYGLVAFASSLDQIGPLANTVEDVALMMDVFSGRDPKDATSREFFRDSFMSKLSTEVPKVIGVPRTFLEGLRDDIRENFFSSLAIFEGEGTHLVDVELDILSHAVSIYYILASAEAATNLARFDGVRYGYRSPQAHTISQLYDLSRGEGFGKEVMRRILLGNYVLSAERQNVYYKKATAVRAKIVKAFRTAFEKCEILAMPVCSSPAFEIGEILDPVTLYLQDIYTVAMNLAYLPAIAVPSGFSKEGLPLGLQIIGQQGQDQQVCQVGYSFQEHAQIKQLFSKRYAKSVVLGGQS' text: [26x58 char]
The pairwise alignment of one gene from Chlamydia trachomatis with all genes from Chlamydophila pneumoniae takes just under a minute on an Intel® Pentium 4, 2.0 GHz machine running Windows® XP. To do this calculation for all 895 CDS in Chlamydia trachomatis would take about 12 hours on the same machine. Uncomment the following code if you want to run the whole calculation.
scores = zeros(M,N); parfor outer = 1:M theScore = zeros(1,outer); theSeq = seqtrachomatis.CDS(outer).translation; for inner = 1:N theScore(inner) = ... nwalign(theSeq,... seqpneumoniae.CDS(inner).translation); end scores(outer,:) = theScore; end
Note the command parfor
is used in the outer loop. If your machine is configured to run multiple labs then the outer loop will be executed in parallel. For a full understanding of this construct, see doc parfor
.
Investigating the Meaning of the Scores
The distributions of the scores for several genes show a pattern. The CDS(3) of Chlamydia trachomatis is the gatC gene. This has a relatively short product,aspartyl/glutamyl-tRNA amidotransferase subunit C, with only 100 residues.
gatCScores = zeros(1,N); for inner = 1:N gatCScores(inner) = nwalign(seqtrachomatis.CDS(3).translation,... seqpneumoniae.CDS(inner).translation); end figure histogram(gatCScores,100) title(sprintf(['Alignment Scores for Chlamydia trachomatis %s\n',... 'with all CDS in Chlamydophila pneumoniae'],seqtrachomatis.CDS(3).gene)) xlabel('Score');ylabel('Number of genes');
The best score again corresponds to the same gene in the Chlamydophila pneumoniae.
[gatCBest, gatCBestIdx] = max(gatCScores); seqpneumoniae.CDS(gatCBestIdx).product
ans = 2x47 char array 'aspartyl/glutamyl-tRNA amidotransferase subunit' 'C '
CDS(339) of Chlamydia trachomatis is the uvrA gene. This has a very long product, excinuclease ABC subunit A, of length 1786.
uvrAScores = zeros(1,N); for inner = 1:N uvrAScores(inner) = nwalign(seqtrachomatis.CDS(339).translation,... seqpneumoniae.CDS(inner).translation); end figure histogram(uvrAScores,100) title(sprintf(['Alignment Scores for Chlamydia trachomatis %s\n',... 'with all CDS in Chlamydophila pneumoniae'],seqtrachomatis.CDS(339).gene)) xlabel('Score');ylabel('Number of genes'); [uvrABest, uvrABestIdx] = max(uvrAScores); seqpneumoniae.CDS(uvrABestIdx)
ans = struct with fields: location: '716887..722367' gene: [] product: 'excinuclease ABC subunit A' codon_start: '1' indices: [716887 722367] protein_id: 'NP_445220.1' db_xref: 'GeneID:963214' note: [6x58 char] translation: 'MKSLPVYVSGIKVRNLKNVSIHFNSEEIVLLTGVSGSGKSSIAFDTLYAAGRKRYISTLPTFFATTITTLPNPKVEEIHGLSPTIAIKQNHFSHYSHATVGSTTELFSHLALLFTLEGQARDPKTKEVLDLYSKEKVLSTIMELSEGVQISILAPLLRKDIAAIHEYAQQGFTKVRCNGTIHPIYSFLTSGIPEDCSVDIVIDTLIKSENNIARLKVSLFTALEFGEGHCSVLSDEELMTFSTKQQIDDVTYTPLTQQLFSPHALESRCSLCQGSGIFISIDNPLLIDENLSIKENCCSFAGNCSSYLYHTIYQALADALNFNLETPWKDLSPEIQNIFLRGKNNLVLPVRLFDQTLGKKNLTYKVWRGVLNDIGDKVRYTTKPSRYLSKGMSAHSCSLCKGTGLGDYASVATWEGKTFTEFQQMSLNNWHVFFSKVKSPSLSIQEILQGLKQRLSFLIDLGLGYLTPNRALATLSGGEQERTAIAKHLGGELFGITYILDEPSIGLHPQDTEKLIGVIKKLRDQGNTVILVEHEERMISLADRIIDIGPGAGIFGGEVLFNGKPEDFLMNSSSLTAKYLRQELTIPIPESREAPTSWLLLTEATIHNLKNLSIRLPLARLIGVTGVSGSGKSSLINNTLVPAIESFLKQENPKNLHFEWGCIGRLIHITRDLPGRSQRSIPLTYIKAFDDIRELFASQPRSLRQGLTKAHFSFNQPQGACIQCQGLGTMTISDDDTPIPCSECQGKRYHSEVLEILYEGKNIADILDMTAYEAEKFFISHPKIHEKIHALCSLRLDYLPLGRPLSTLSGGEIQRLKLAHELLFASPKQTLYVLDEPTTGLHTHDIQALIEVLLSLTYLGHTVLVIEHNMHVVKVCDYVLELGPEGGDLGGYLLASCTPKDLIQLNTPTAKALAPYIEGSLDIPVVKSEPPSSPKSCDILIKDAYQNNLKHIDLALPRNSLIAIAGPGASGKHSLVFDILYASGNIAYAELFPPYIRQGLLKETPLPSVGEVKGLSPVISVRKCSSSNRSYHTIASALGLSNGLEKLFAILGEPFSPLTEEKLSKTTPQTIIDSLLKSYKDDYVTITSPIPLGSDLEIFLQEKQKEGFIKLYSEGNLYDLDERLPLNLIEPAIVIQHTKVSPKNSSSLLSAISVAFSLSSEIWIYISQKKQRKLSYSLGWKDKKGRLYPEITHQLLSSDHPEGRCLTCGGRGEILKISLEEHKEKIAHYTPLEFFSLFFPKSYMKPVQKLLKDENASQPLKLLTTKEFLNFCRGSSEFPGMNALLMEQLDTESDSPLIKPLLALTSCPACKGSGLNDYANYVRINNTSLLDIYQEDATFLESFLNTIGTDDTRSIIQDLMNRLTFISKVGLSYITLGQRQDTLSDGENYRLHLAKKISINLTNIVYLFEEPLSGLHPQDLPTIVQLLKELVANNNTVIATDRSCSLIPHADHAIFLGPGSGPQGGFLMDSDTEVCPSVDLHANVPQTEVCPKAPLSISKANHTRGSDRTLKVNLSIHHIQNLKVSAPLHALVAIGGVSGSGKTSLLLEGFKKQAELLIAKGTTTFSDLVVIDSHPIASSQRSDISTYFDIAPSLRAFYASLTQAKALNISSTMFSTNTKQGQCSDCQGLGYQWIDRAFYALEKRPCPTCSGFRIQPLAQEVLYEGKHFGELLHTPIETVALRFPFIKKIQKPLKALLDIGLGYLPIGQKLSSLSVSEKTALKTAYFLYQTPETPTLFLIDELFSSLDPIKKQHLPEKLRSLINSGHSVIYIDHDVKLLKSADYLIEIGPGSGKQGGKLLFSGSPKDIYASKDSLLKKYICNEELDS' text: [46x58 char]
The distribution of the scores is affected by the length of the sequences, with very long sequences potentially having much higher or lower scores than shorter sequences. You can normalize for this in a number of ways. One way is to divide by the length of the sequences.
lnormgatABest = gatABest./length(seqtrachomatis.CDS(4).product) lnormgatCBest = gatCBest./length(seqtrachomatis.CDS(3).product) lnormuvrABest = uvrABest./length(seqtrachomatis.CDS(339).product)
lnormgatABest = 16.8794 lnormgatCBest = 2.2695 lnormuvrABest = 78.9615
An alternative normalization method is to use the self alignment score, that is the score from aligning the sequence with itself.
gatASelf = nwalign(seqtrachomatis.CDS(4).translation,... seqtrachomatis.CDS(4).translation); gatCSelf = nwalign(seqtrachomatis.CDS(3).translation,... seqtrachomatis.CDS(3).translation); uvrASelf = nwalign(seqtrachomatis.CDS(339).translation,... seqtrachomatis.CDS(339).translation); normgatABest = gatABest./gatASelf normgatCBest = gatCBest./gatCSelf normuvrABest = uvrABest./uvrASelf
normgatABest = 0.7380 normgatCBest = 0.5212 normuvrABest = 0.5253
Using Sparse Matrices to Reduce Memory Usage
The all-against-all alignment calculation not only takes a lot of time, it also generates a large matrix of scores. If you are looking for similar genes across species, then the scores that are interesting are the positive scores that indicate good alignment. However, most of these scores are negative, and the actual values are not particularly useful for this type of study. Sparse matrices allow you to store the interesting values in a more efficient way.
The sparse matrix, spScores
, in the MAT-file chlamydia.mat
contains the positive values from the all against all pairwise alignment calculation normalized by self-alignment score.
load('chlamydia.mat','spScores')
With the matrix of scores you can look at the distribution of scores of Chlamydophila pneumoniae genes aligned with Chlamydia trachomatis and the converse of this, Chlamydia trachomatis genes aligned with Chlamydophila pneumoniae genes
figure subplot(2,1,1) histogram(max(spScores),100) title('Highest Alignment Scores for Chlamydophila pneumoniae Genes') xlabel('Score');ylabel('Number of genes'); subplot(2,1,2) histogram(max(spScores,[],2),100) title('Highest Alignment Scores for Chlamydia trachomatis Genes') xlabel('Score');ylabel('Number of genes');
Remember that there are 1112 CDS in Chlamydophila pneumoniae and only 895 in Chlamydia trachomatis. The high number of zero scores in the top histogram indicates that many of the extra CDS in Chlamydophila pneumoniae do not have good matches in Chlamydia trachomatis.
Another way to visualize the data is to look at the positions of points in the scores matrix that are positive. The sparse function spy
is an easy way to quickly view dotplots of matrices. This shows some interesting structure in the positions of the high scoring matches.
figure
spy(spScores > 0)
title(sprintf('Dot Plot of High-Scoring Alignments.\nNormalized Threshold = 0'))
Raise the threshold a little higher to see clear diagonal lines in the plot.
spy(spScores >.1)
title(sprintf('Dot Plot of High-Scoring Alignments.\nNormalized Threshold = 0.1'))
Remember that these are circular genomes, and it seems that the starting points in GenBank are arbitrary. Permute the scores matrix so that the best match of the first CDS in Chlamydophila pneumoniae is in the first row to see a clear diagonal plot. This shows the synteny between the two organisms.
[bestScore bestMatch] = max(spScores(:,1));
spy(spScores([bestMatch:end 1:bestMatch-1],:)>.1);
title('Synteny Plot of Chlamydophila pneumoniae and Chlamydia trachomatis')
Looking for Homologous Genes
Genes in different genomes that are related to each other are said to be homologous. Similarity can be by speciation (orthologous genes) or by replication (paralogous genes). Having the scoring matrix lets you look for both types of relationships.
The most obvious way to find orthologs is to look for the highest scoring pairing for each gene. If the score is significant then these best reciprocal pairs are very likely to be orthologous.
[bestScores, bestIndices] = max(spScores);
The variable bestIndices contains the index of the best reciprocal pairs for the genes in Chlamydophila pneumoniae. Sort the best scores and create a table to compare the description of the best reciprocal pairs and discover very high similarity between the highest scoring best reciprocal pairs.
[orderedScores, permScores] = sort(full(bestScores),'descend'); matches = [num2cell(orderedScores)',num2cell(bestIndices(permScores))',... num2cell((permScores))',... {seqtrachomatis.CDS(bestIndices(permScores)).product;... seqpneumoniae.CDS((permScores)).product; }']; for count = 1:7 fprintf(['Score %f\nChlamydia trachomatis Gene : %s\n',... 'Chlamydophila pneumoniae Gene : %s\n\n'],... matches{count,1}, matches{count,4}, matches{count,5}) end
Score 0.982993 Chlamydia trachomatis Gene : 50S ribosomal protein L36 Chlamydophila pneumoniae Gene : 50S ribosomal protein L36 Score 0.981818 Chlamydia trachomatis Gene : 30S ribosomal protein S15 Chlamydophila pneumoniae Gene : 30S ribosomal protein S15 Score 0.975422 Chlamydia trachomatis Gene : integration host factor alpha-subunit Chlamydophila pneumoniae Gene : integration host factor beta-subunit Score 0.971647 Chlamydia trachomatis Gene : 50S ribosomal protein L16 Chlamydophila pneumoniae Gene : 50S ribosomal protein L16 Score 0.970105 Chlamydia trachomatis Gene : 30S ribosomal protein S10 Chlamydophila pneumoniae Gene : 30S ribosomal protein S10 Score 0.969554 Chlamydia trachomatis Gene : rod shape-determining protein MreB Chlamydophila pneumoniae Gene : rod shape-determining protein MreB Score 0.953654 Chlamydia trachomatis Gene : hypothetical protein Chlamydophila pneumoniae Gene : hypothetical protein
You can use the Variable Editor to look at the data in a spreadsheet format.
open('matches')
Compare the descriptions to see that the majority of the best reciprocal pairs have identical descriptions.
exactMatches = strcmpi(matches(:,4),matches(:,5)); sum(exactMatches)
ans = 808
Perhaps more interesting are the best reciprocal pairs where the descriptions are not identical. Some are simply differences in how the same gene is described, but others show quite different descriptions.
mismatches = matches(~exactMatches,:); for count = 1:7 fprintf(['Score %f\nChlamydia trachomatis Gene : %s\n',... 'Chlamydophila pneumoniae Gene : %s\n\n'],... mismatches{count,1}, mismatches{count,4}, mismatches{count,5}) end
Score 0.975422 Chlamydia trachomatis Gene : integration host factor alpha-subunit Chlamydophila pneumoniae Gene : integration host factor beta-subunit Score 0.929565 Chlamydia trachomatis Gene : low calcium response D Chlamydophila pneumoniae Gene : type III secretion inner membrane protein SctV Score 0.905000 Chlamydia trachomatis Gene : NrdR family transcriptional regulator Chlamydophila pneumoniae Gene : transcriptional regulator NrdR Score 0.903226 Chlamydia trachomatis Gene : Yop proteins translocation protein S Chlamydophila pneumoniae Gene : type III secretion inner membrane protein SctS Score 0.896212 Chlamydia trachomatis Gene : ATP-dependent protease ATP-binding subunit ClpX Chlamydophila pneumoniae Gene : ATP-dependent protease ATP-binding protein ClpX Score 0.890705 Chlamydia trachomatis Gene : ribonuclease E Chlamydophila pneumoniae Gene : ribonuclease G Score 0.884234 Chlamydia trachomatis Gene : ClpC protease ATPase Chlamydophila pneumoniae Gene : ATP-dependent Clp protease ATP-binding protein
View data for mismatches.
open('mismatches')
Once you have the scoring matrix this opens up many possibilities for further investigation. For example, you could look for CDS where there are multiple high scoring reciprocal CDS. See Cristianini and Hahn [1] for further ideas.
References
[1] Cristianini, N. and Hahn, M.W., "Introduction to Computational Genomics: A Case Studies Approach", Cambridge University Press, 2007.