Hello Joel
I'd like to be able to calculate the percent identity for two
sequences in an alignment. The attached alignment (with several empty
columns) contains two sequences that were pulled from a larger
structure-based alignment generated by Dali. In Jalview, when I select
the two sequences and perform a pairwise alignment calculation
(Calculate —> Pairwise Alignments...) the output (attached) only
includes an alignment that contains only 7 columns, but the two
sequences are 204 and 224 aa in length and the structures are highly
conserved throughout.
Confirmed.
Why isn't Jalview comparing the sequences along their full length, and
can I force it to do so?
I suspect you may not realise that the 'Pairwise alignment' option
actually computes a Needleman and Wunsch pairwise alignment for each
pair of sequences in the selected set, using a BLOSUM 62 matrix and
nominal gap parameters (120 for opening, 20 for widening). Whilst these
parameters give a reasonable alignment for sequences with high sequence
homology, it they can fail for less homologous pairs. In your case,
you're trying to align a pair of structurally homologous protein
sequences which have quite a low sequence identity - and the algorithm
just returns a stretch of 7 aa that align well, without any of the other
regions of the two sequences, because the gaps introduced into the
alignment make them far less optimal.
If Jalview won't compare full length sequences, is there another
program that will?
There are plenty out there (checkout EMBOSS, for instance:
EMBOSS Servers), but I get the impression
that what you actually want is the percentage identity of the pair of
sequences as aligned by DALI. Apart from looking in the DALI report
(where,if I remember correctly, you will always find a percent identity
score in addition to Dali's own Z-score), the quickest way to do this
in the current version of Jalview is to copy one or both of sequences
into the same alignment, and then calculating a percent identity tree.
The branches will be labelled with the %age difference between the
sequences, *under current alignment length*. The reason I stress this is
because If I do this with your DALI alignment as you sent it, I get a
value of 9.3 - ie the sequences are 90.7% identical - however, if I
exclude the gapped columns in the alignment (using Edit->Remove empty
columns), I get 37.5 - ie 63.5% identical. This number is probably still
not reliable, because there are a fair few 'X' symbols in both sequences
that do not align to ther Xes, and Jalview will count these as a
mismatch, rather than a match (also now reported as a bug).
I will schedule for implementation a new function allowing a pairwise
%age identity matrix (or flat report) to be generated, enabling you to
do these calculations more easily.
Hope this clears things up - thanks for the email!
Jim.
ps. if you find the last comment about gaps/non gaps confusing, you
might want to check out Geoff Barton's paper about percentage identity,
and this wiki page :
http://openwetware.org/wiki/Wikiomics:Percentage_identity
···
On 25/02/2011 21:08, Joel Guenther wrote:
--
-------------------------------------------------------------------
J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.