Calculating the percent identity between two sequences

guenthej · 25 February 2011 21:08

Hi, all.

I’d like to be able to calculate the percent identity for two sequences in an alignment. The attached alignment (with several empty columns) contains two sequences that were pulled from a larger structure-based alignment generated by Dali. In Jalview, when I select the two sequences and perform a pairwise alignment calculation (Calculate —> Pairwise Alignments…) the output (attached) only includes an alignment that contains only 7 columns, but the two sequences are 204 and 224 aa in length and the structures are highly conserved throughout.

Why isn’t Jalview comparing the sequences along their full length, and can I force it to do so?

If Jalview won’t compare full length sequences, is there another program that will?

Thanks for you help!

-Joel

Percent identity fail.txt (287 Bytes)

Percent identity test.fasta (1.9 KB)

···

=================================================
Joel M. Guenther, PhD
Department of Chemistry
Kuriyan Laboratory
http://jkweb.berkeley.edu

University of California, Berkeley
176 Stanley Hall, QB3
Berkeley, CA 94720-3220
tel: (510) 643 0166
fax: (510) 643 2352

jalviewcrowdadmin · 27 February 2011 14:32

Hello Joel

I'd like to be able to calculate the percent identity for two
sequences in an alignment. The attached alignment (with several empty
columns) contains two sequences that were pulled from a larger
structure-based alignment generated by Dali. In Jalview, when I select
the two sequences and perform a pairwise alignment calculation
(Calculate —> Pairwise Alignments...) the output (attached) only
includes an alignment that contains only 7 columns, but the two
sequences are 204 and 224 aa in length and the structures are highly
conserved throughout.

Confirmed.

Why isn't Jalview comparing the sequences along their full length, and
can I force it to do so?

I suspect you may not realise that the 'Pairwise alignment' option
actually computes a Needleman and Wunsch pairwise alignment for each
pair of sequences in the selected set, using a BLOSUM 62 matrix and
nominal gap parameters (120 for opening, 20 for widening). Whilst these
parameters give a reasonable alignment for sequences with high sequence
homology, it they can fail for less homologous pairs. In your case,
you're trying to align a pair of structurally homologous protein
sequences which have quite a low sequence identity - and the algorithm
just returns a stretch of 7 aa that align well, without any of the other
regions of the two sequences, because the gaps introduced into the
alignment make them far less optimal.

If Jalview won't compare full length sequences, is there another
program that will?

There are plenty out there (checkout EMBOSS, for instance:
EMBOSS Servers), but I get the impression
that what you actually want is the percentage identity of the pair of
sequences as aligned by DALI. Apart from looking in the DALI report
(where,if I remember correctly, you will always find a percent identity
score in addition to Dali's own Z-score), the quickest way to do this
in the current version of Jalview is to copy one or both of sequences
into the same alignment, and then calculating a percent identity tree.
The branches will be labelled with the %age difference between the
sequences, *under current alignment length*. The reason I stress this is
because If I do this with your DALI alignment as you sent it, I get a
value of 9.3 - ie the sequences are 90.7% identical - however, if I
exclude the gapped columns in the alignment (using Edit->Remove empty
columns), I get 37.5 - ie 63.5% identical. This number is probably still
not reliable, because there are a fair few 'X' symbols in both sequences
that do not align to ther Xes, and Jalview will count these as a
mismatch, rather than a match (also now reported as a bug).

I will schedule for implementation a new function allowing a pairwise
%age identity matrix (or flat report) to be generated, enabling you to
do these calculations more easily.

Hope this clears things up - thanks for the email!
Jim.

ps. if you find the last comment about gaps/non gaps confusing, you
might want to check out Geoff Barton's paper about percentage identity,
and this wiki page :
http://openwetware.org/wiki/Wikiomics:Percentage_identity

···

On 25/02/2011 21:08, Joel Guenther wrote:

--
-------------------------------------------------------------------
J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.

guenthej · 27 February 2011 22:16

Hi, Jim.

Thanks for the reply. Following your advice, I was able to calculate a percent identity between two sequences (with empty columns removed) using:
Calculate —> Calculate Tree —> Neighbor joining using % Identity

If you have time, adding an percentage identity matrix out to Jalview would be nice, but not essential.

Thanks again!

-Joel

···

On Sun, Feb 27, 2011 at 6:32 AM, Jim Procter <jprocter@compbio.dundee.ac.uk> wrote:

Hello Joel

On 25/02/2011 21:08, Joel Guenther wrote:

I’d like to be able to calculate the percent identity for two
sequences in an alignment. The attached alignment (with several empty
columns) contains two sequences that were pulled from a larger
structure-based alignment generated by Dali. In Jalview, when I select
the two sequences and perform a pairwise alignment calculation
(Calculate —> Pairwise Alignments…) the output (attached) only
includes an alignment that contains only 7 columns, but the two
sequences are 204 and 224 aa in length and the structures are highly
conserved throughout.

Confirmed.

Why isn’t Jalview comparing the sequences along their full length, and
can I force it to do so?

I suspect you may not realise that the ‘Pairwise alignment’ option
actually computes a Needleman and Wunsch pairwise alignment for each
pair of sequences in the selected set, using a BLOSUM 62 matrix and
nominal gap parameters (120 for opening, 20 for widening). Whilst these
parameters give a reasonable alignment for sequences with high sequence
homology, it they can fail for less homologous pairs. In your case,
you’re trying to align a pair of structurally homologous protein
sequences which have quite a low sequence identity - and the algorithm
just returns a stretch of 7 aa that align well, without any of the other
regions of the two sequences, because the gaps introduced into the
alignment make them far less optimal.

If Jalview won’t compare full length sequences, is there another
program that will?

There are plenty out there (checkout EMBOSS, for instance:
http://emboss.sourceforge.net/servers/#pise), but I get the impression
that what you actually want is the percentage identity of the pair of
sequences as aligned by DALI. Apart from looking in the DALI report
(where,if I remember correctly, you will always find a percent identity
score in addition to Dali’s own Z-score), the quickest way to do this
in the current version of Jalview is to copy one or both of sequences
into the same alignment, and then calculating a percent identity tree.
The branches will be labelled with the %age difference between the
sequences, under current alignment length. The reason I stress this is
because If I do this with your DALI alignment as you sent it, I get a
value of 9.3 - ie the sequences are 90.7% identical - however, if I
exclude the gapped columns in the alignment (using Edit->Remove empty
columns), I get 37.5 - ie 63.5% identical. This number is probably still
not reliable, because there are a fair few ‘X’ symbols in both sequences
that do not align to ther Xes, and Jalview will count these as a
mismatch, rather than a match (also now reported as a bug).

I will schedule for implementation a new function allowing a pairwise
%age identity matrix (or flat report) to be generated, enabling you to
do these calculations more easily.

Hope this clears things up - thanks for the email!
Jim.

ps. if you find the last comment about gaps/non gaps confusing, you
might want to check out Geoff Barton’s paper about percentage identity,
and this wiki page :
http://openwetware.org/wiki/Wikiomics:Percentage_identity

–

J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.

Jalview-discuss mailing list
Jalview-discuss@jalview.org
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss

engin_ozkan1 · 28 February 2011 07:01

Dear Jim,

Let me second a request for adding a percent id matrix output, especially straight from the multiple sequence alignment. MSAs can be better than pairwise alignments especially in the case of distant pairs, and reporting pairwise identities based on a new N-W alignment, while the better alignment (hopefully) exists within the MSA, seems unwanted. Come to think of it, the option to calculate “new” pairwise alignments should still exist, in case the MSA is of lower quality.

Thanks for the great work,

Engin

···

-- 
Engin Özkan 
Post-doctoral Scholar 
Laboratory of K. Christopher Garcia 
Howard Hughes Medical Institute 
Dept of Molecular and Cellular Physiology 
279 Campus Drive, Beckman Center B173 
Stanford School of Medicine 
Stanford, CA 94305 
ph: (650)-498-7111

guenthej · 25 May 2011 06:52

Hi, Jim.

A noticed an odd behavior when calculating the percent identity between two sequences using the Calculate Tree features. The values that are returned appear to be too high because, when at least one of the two sequences contains a gap, then the column is scored as an identity. My guess is that the gap character is being scored as a wildcard that matches all other characters. I could be wrong, though, because I didn’t investigate very thoroughly.

-Joel

···

On Sun, Feb 27, 2011 at 11:01 PM, Engin Özkan <eozkan@stanford.edu> wrote:

Dear Jim,

Let me second a request for adding a percent id matrix output, especially straight from the multiple sequence alignment. MSAs can be better than pairwise alignments especially in the case of distant pairs, and reporting pairwise identities based on a new N-W alignment, while the better alignment (hopefully) exists within the MSA, seems unwanted. Come to think of it, the option to calculate “new” pairwise alignments should still exist, in case the MSA is of lower quality.

Thanks for the great work,

Engin

On 2/27/11 2:16 PM, Joel Guenther wrote:
Hi, Jim.

Thanks for the reply. Following your advice, I was able to calculate a percent identity between two sequences (with empty columns removed) using:
Calculate —> Calculate Tree —> Neighbor joining using % Identity

If you have time, adding an percentage identity matrix out to Jalview would be nice, but not essential.

Thanks again!

-Joel

On Sun, Feb 27, 2011 at 6:32 AM, Jim Procter <jprocter@compbio.dundee.ac.uk> wrote:

Hello Joel

On 25/02/2011 21:08, Joel Guenther wrote:

I’d like to be able to calculate the percent identity for two
sequences in an alignment. The attached alignment (with several empty
columns) contains two sequences that were pulled from a larger
structure-based alignment generated by Dali. In Jalview, when I select
the two sequences and perform a pairwise alignment calculation
(Calculate —> Pairwise Alignments…) the output (attached) only
includes an alignment that contains only 7 columns, but the two
sequences are 204 and 224 aa in length and the structures are highly
conserved throughout.

Confirmed.

Why isn’t Jalview comparing the sequences along their full length, and
can I force it to do so?

I suspect you may not realise that the ‘Pairwise alignment’ option
actually computes a Needleman and Wunsch pairwise alignment for each
pair of sequences in the selected set, using a BLOSUM 62 matrix and
nominal gap parameters (120 for opening, 20 for widening). Whilst these
parameters give a reasonable alignment for sequences with high sequence
homology, it they can fail for less homologous pairs. In your case,
you’re trying to align a pair of structurally homologous protein
sequences which have quite a low sequence identity - and the algorithm
just returns a stretch of 7 aa that align well, without any of the other
regions of the two sequences, because the gaps introduced into the
alignment make them far less optimal.

If Jalview won’t compare full length sequences, is there another
program that will?

There are plenty out there (checkout EMBOSS, for instance:
http://emboss.sourceforge.net/servers/#pise), but I get the impression
that what you actually want is the percentage identity of the pair of
sequences as aligned by DALI. Apart from looking in the DALI report
(where,if I remember correctly, you will always find a percent identity
score in addition to Dali’s own Z-score), the quickest way to do this
in the current version of Jalview is to copy one or both of sequences
into the same alignment, and then calculating a percent identity tree.
The branches will be labelled with the %age difference between the
sequences, under current alignment length. The reason I stress this is
because If I do this with your DALI alignment as you sent it, I get a
value of 9.3 - ie the sequences are 90.7% identical - however, if I
exclude the gapped columns in the alignment (using Edit->Remove empty
columns), I get 37.5 - ie 63.5% identical. This number is probably still
not reliable, because there are a fair few ‘X’ symbols in both sequences
that do not align to ther Xes, and Jalview will count these as a
mismatch, rather than a match (also now reported as a bug).

I will schedule for implementation a new function allowing a pairwise
%age identity matrix (or flat report) to be generated, enabling you to
do these calculations more easily.

Hope this clears things up - thanks for the email!
Jim.

ps. if you find the last comment about gaps/non gaps confusing, you
might want to check out Geoff Barton’s paper about percentage identity,
and this wiki page :
http://openwetware.org/wiki/Wikiomics:Percentage_identity

–

J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.

Jalview-discuss mailing list
Jalview-discuss@jalview.org
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss
_______________________________________________
Jalview-discuss mailing list
[Jalview-discuss@jalview.org](mailto:Jalview-discuss@jalview.org)
[http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss](http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss)
-- 
Engin Özkan 
Post-doctoral Scholar 
Laboratory of K. Christopher Garcia 
Howard Hughes Medical Institute 
Dept of Molecular and Cellular Physiology 
279 Campus Drive, Beckman Center B173 
Stanford School of Medicine 
Stanford, CA 94305 
ph: (650)-498-7111
Jalview-discuss mailing list
Jalview-discuss@jalview.org
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss

jalviewcrowdadmin · 30 May 2011 08:11

Hi Joel

A noticed an odd behavior when calculating the percent identity between two sequences using the Calculate Tree features. The values that are returned appear to be too high because, when at least one of the two sequences contains a gap, then the column is scored as an identity. My guess is that the gap character is being scored as a wildcard that matches all other characters. I could be wrong, though, because I didn't investigate very thoroughly.

You inferred the situation correctly. Jalview's PID calculation treats a gap as a wildcard, which is somewhat counter intuitive, but not as crazy as it first sounds. The same PID routine is used for redundancy filtering, sort by PID and for the pairwise matrix input to tree calculations. For the two former cases, treating a gap as a wildcard means that Jalview culls or orders sequences intuitively - ie - by calculating PID relative to the aligned regions of a particular sequence. However, it does cause problems for the tree calculation.

I've lodged a bug about this - since its high-time that the PID functions in jalview had an overhaul. Check out http://issues.jalview.org/browse/JAL-837 for the gory details. At the very least, some additional documentation is needed to clarify the situation w.r.t. PID and gaps in sequences.

Jim.

···

On 25/05/2011 16:52, Joel Guenther wrote:

guenthej · 30 May 2011 19:07

Thanks, Jim!

-Joel

···

On Mon, May 30, 2011 at 1:11 AM, Jim Procter <jprocter@compbio.dundee.ac.uk> wrote:

Hi Joel

On 25/05/2011 16:52, Joel Guenther wrote:

A noticed an odd behavior when calculating the percent identity between two sequences using the Calculate Tree features. The values that are returned appear to be too high because, when at least one of the two sequences contains a gap, then the column is scored as an identity. My guess is that the gap character is being scored as a wildcard that matches all other characters. I could be wrong, though, because I didn’t investigate very thoroughly.

You inferred the situation correctly. Jalview’s PID calculation treats a gap as a wildcard, which is somewhat counter intuitive, but not as crazy as it first sounds. The same PID routine is used for redundancy filtering, sort by PID and for the pairwise matrix input to tree calculations. For the two former cases, treating a gap as a wildcard means that Jalview culls or orders sequences intuitively - ie - by calculating PID relative to the aligned regions of a particular sequence. However, it does cause problems for the tree calculation.

I’ve lodged a bug about this - since its high-time that the PID functions in jalview had an overhaul. Check out http://issues.jalview.org/browse/JAL-837 for the gory details. At the very least, some additional documentation is needed to clarify the situation w.r.t. PID and gaps in sequences.

Jim.

ajperry · 31 May 2011 02:28

I’ve noticed that calculating PID is something many users, or potential users, of Jalview request quite often (including wanting it myself, from time to time). Even though there might be better measures of similarity out there, PID is the one that all biologists understand and will always request when a discussion about sequence similarity arises.

You already beat me to mentioning http://openwetware.org/wiki/Wikiomics:Percentage_identity (and the associated publication http://www.biomedcentral.com/1471-2105/7/415 ) which does a good job of highlighting that there is more than one way to calculate PID, and there is probably no ‘right way’.

Given this, could I request that we have:
a) a simple way to calculate PID, from the existing alignment (without realignment) between pairs of sequence (or an all-against-all matrix)
b) that the interface is clear on the equation used to calculate the PID, maybe even also citing Raghava & Barton and note the equation that is used (eg PID1, PID4 etc). In fact, allowing an option to use any one of the equations in Raghava & Barton would allow others to repeat this analysis and possibly function as an easy way to replicate PID values quoted in the literature.

I’ve just made a bug tracker account - I see you are addressing some of this already.

Andrew Perry

Postdoctoral Fellow
Whisstock Lab
Department of Biochemistry and Molecular Biology
Monash University, Clayton Campus, PO Box 13d, VIC, 3800, Australia.

···

On Mon, May 30, 2011 at 6:11 PM, Jim Procter <jprocter@compbio.dundee.ac.uk> wrote:

Hi Joel

On 25/05/2011 16:52, Joel Guenther wrote:

A noticed an odd behavior when calculating the percent identity
between two sequences using the Calculate Tree features. The values
that are returned appear to be too high because, when at least one of
the two sequences contains a gap, then the column is scored as an
identity. My guess is that the gap character is being scored as a
wildcard that matches all other characters. I could be wrong, though,
because I didn’t investigate very thoroughly.

You inferred the situation correctly. Jalview’s PID calculation treats a
gap as a wildcard, which is somewhat counter intuitive, but not as crazy
as it first sounds. The same PID routine is used for redundancy
filtering, sort by PID and for the pairwise matrix input to tree
calculations. For the two former cases, treating a gap as a wildcard
means that Jalview culls or orders sequences intuitively - ie - by
calculating PID relative to the aligned regions of a particular
sequence. However, it does cause problems for the tree calculation.

I’ve lodged a bug about this - since its high-time that the PID
functions in jalview had an overhaul. Check out
http://issues.jalview.org/browse/JAL-837 for the gory details. At the
very least, some additional documentation is needed to clarify the
situation w.r.t. PID and gaps in sequences.

Jim.

geoff.barton · 31 May 2011 08:09

Of course, there are many ways to calculate PID, so often it is misunderstood and abused. Jalview currently uses one of the possible methods which suits some purposes but not others. We probably should add a full set of PID calculation options to Jalview so that those who know can choose the most appropriate method for their needs.

See: http://www.biomedcentral.com/1471-2105/7/415

for a brief discussion of the problems of PID.

Geoff

···

-- 
Geoff Barton, Professor of Bioinformatics,  College of Life Sciences
University of Dundee, Scotland, UK.          [g.j.barton@dundee.ac.uk](mailto:g.j.barton@dundee.ac.uk)
Tel:+44 1382 385860/388731 (Fax:385764)     [www.compbio.dundee.ac.uk](http://www.compbio.dundee.ac.uk)

The University of Dundee is registered Scottish charity: No.SC015096

Calculating the percent identity between two sequences

University of California, Berkeley 176 Stanley Hall, QB3 Berkeley, CA 94720-3220 tel: (510) 643 0166 fax: (510) 643 2352

–

–

University of California, Berkeley
176 Stanley Hall, QB3
Berkeley, CA 94720-3220
tel: (510) 643 0166
fax: (510) 643 2352