[Question - redundancy]

geoff.barton · 15 April 2010 11:56

Dear Colleague,

I used Remove Redundancy (Edit) in Jalview software, I wanted to remove
identical sequences. I have 1409 sequences. When I loaded sequences in FASTA
format I got above 500 sequences but when I loaded alignment of these
sequences (from ClustalX) I got above 600 sequences. What is the source of
this difference?

Damian Mielecki
Institute of Biochemistry and Biophysics PAS
Molecular Biology Department
Pawiñskiego 5a
02-106 Warszawa
Poland

···

-------- Original Message --------
Subject: Question - redundancy
Date: Thu, 15 Apr 2010 13:51:38 +0200
From: Damian Mielecki <damian@ibb.waw.pl>
To: geoff@compbio.dundee.ac.uk

jalviewcrowdadmin · 15 April 2010 12:30

Redundancy is calculated on the sequences as they are currently alignment, not the unaligned sequences. So normally, you'd expect to remove more sequences for a given threshold if they are already aligned. This explains the difference you observe, but not its sign - since oddly, you've observed the opposite to what I'd expect (if I understand your email correctly). Do you really find that the alignment is less redundant than the original sequence set ?

Furthermore, having thought about this a little - it would probably be a good idea to add an 'unaligned similarity' switch to the redundancy slider so you can toggle between aligned and unaligned %id.

Jim.

···

On 15/04/2010 12:56, Geoff Barton wrote:

-------- Original Message --------
Subject: Question - redundancy
Date: Thu, 15 Apr 2010 13:51:38 +0200
From: Damian Mielecki<damian@ibb.waw.pl>
To: geoff@compbio.dundee.ac.uk

Dear Colleague,

I used Remove Redundancy (Edit) in Jalview software, I wanted to remove
identical sequences. I have 1409 sequences. When I loaded sequences in
FASTA
format I got above 500 sequences but when I loaded alignment of these
sequences (from ClustalX) I got above 600 sequences. What is the source of
this difference?

--
-------------------------------------------------------------------
J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.

jalviewcrowdadmin · 15 April 2010 13:21

Dear Jim,

Thank You a lot for the replay. It is very possible that I have confused
each other.

ah. I did wonder

Do You think it is a good way to remove identical sequences
(since they could be in several copies downloaded from NCBI database)?

It is a reasonable algorithm for removing redundant sequences ... there are approaches which try to be more clever, but for the identical sequence case - this is as good as any (i.e. it picks the longest of two sequences when one contains the other).

Additionally, I would like to produce tree. I have a lot of sequences.

ok - using the alignment ? I'd suggest you try using TOPALi, MEGA, or SplitsTree (google for these plus 'evolution' to disambiguate). All of these should be able to make good trees for a 1000 sequences or so - and TOPALi and MEGA both do very accurate trees (but this can take some time!).

Is
there any software that would assign particular sequences (having for
instance only GI number) to particular taxon (organism name, a group of
organisms etc.)?

Ah. good question. I've not found an easy to use tool that does this reliably - and I will, at some point, have to implement a service in Jalview to do this. If you can cope with some Perl coding, then you could try this script posted on the bioperl list:
http://bioperl.org/pipermail/bioperl-l/2009-April/029743.html

Best of luck,
Jim.

···

On 15/04/2010 14:05, Damian Mielecki wrote:

--
-------------------------------------------------------------------
J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.