pca

Dear members

Can you please help me understand why the blusom score in the PCA analysis is not reciprocal.

Those are the results of the output values of 4 proteins sequences PCA.

Seq1 seq2 seq3 seq4

seq1 1292.00 930.00 643.00 631.00

seq2 931.00 1289.00 589.00 633.00

seq3 622.00 567.00 1338.00 768.00

seq4 629.00 630.00 785.00 1303.00

Why the score of seq1-seq3 is 643 while the seq3-seq1 score is 622?

Regards,

Hadas Ner-Gaon

Hadas Ner-Gaon , Ph.D.
The Rubin lab
The Shraga Segal Dept. of Microbiology and Immunology & NIBN
Ben Gurion University

Building 39, room -113
POB 653, Beer Sheva 84105, Israel
Phone: 08-6477180
email: nergaon@bgu.ac.il

Dear Hadas, thanks for your email.

``On 15/02/2010 14:04, Hadas Ner Gaon wrote:

Dear members

Can you please help me understand why the blusom score in the PCA analysis is not reciprocal.

Those are the results of the output values of 4 proteins sequences PCA.

Seq1 seq2 seq3 seq4

seq1 1292.00 930.00 643.00 631.00

seq2 931.00 1289.00 589.00 633.00

seq3 622.00 567.00 1338.00 768.00

seq4 629.00 630.00 785.00 1303.00

Why the score of seq1-seq3 is 643 while the seq3-seq1 score is 622?

A good question! In the matrix used for the PCA calculation, each element e(i,j) represents the sum of substitution scores for mutating the symbols in the i’th sequence into the corresponding symbol in the j’th sequence. For proteins, the substitution matrix used is the blosum62 matrix - and because this is not symmetric (ie the score for mutating an R to a G is different to the score for mutating a G to an R), there are often differences between the upper and lower triangles of the similarity matrix. Its simplest to consider each triangle as representing the ‘forward’ or ‘backwards’ mutation cost for each pair of sequences in the alignment.

As you may be aware, the matrix that I just described differs slightly from the one given in the ‘SeqSpace’ paper cited in the jalview PCA documentation (http://www.jalview.org/help/html/calculations/pca.html). In the original paper (Casari, Sander and Valencia 1995 : http://novacripta.cbm.uam.es/bioweb/courses/MasterBiofis0708/tema03/Casari_NatStructBiol_95.pdf ), the matrix used for PCA analysis is called the comparison matrix, and is defined as the product of a matrix representation of the alignment with its transpose:

C = F x T(F)

Here, C is a symmetric n by n matrix, because each element of the matrix is the sum of identical pairs of symbols for the corresponding pair of sequences in the alignment. Jalview’s slightly different comparison matrix calculation should, in theory, reflect favourable mutations between sequences in addition to conservation. However, in my limited tests, the resulting PCA plot often resembles that produced by the original algorithm’s projection, so this refinement probably doesn’t improve greatly on the seqspace approach.

thanks for the question, and happy Jalviewing!
Jim.

ps. this difference between seqspace and Jalview is not made clear in the documentation. This will be rectified in a future release.

···
-- 
-------------------------------------------------------------------
J. B. Procter  (JALVIEW/ENFIN)  Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764  [http://www.compbio.dundee.ac.uk](http://www.compbio.dundee.ac.uk)
The University of Dundee is a Scottish Registered Charity, No. SC015096.