I am happy to report that the Alignment -> Mafft with Defaults (prior to the redundancy removal) works exactly as expected. Thank you. This is what I am trying to achieve.
What I next needed to see is whether this pipeline can handle a FASTA file with similar AND dissimilar sequences in it. Unfortunately, Jalview tried to align all sequences in it, introducing gaps in the middle (naturally) which threw off the redundancy removal process also. I was able to partially remedy this by creating groups of similar sequences, but then I had to do the alignment & redundancy for each group separately.
This would be impossible to do from a large FASTA file that I have, with 430,000+ sequences and about 250MB in size.
What I have come to realize is that there is probably no single program that can help me do what I am trying to achieve: remove redundancies from a large FASTA file. So I shall use Jalview differently, as a quick check for short chunks of sequences.
Provided the memory issue is resolved. (My last email to you, titled "memory issue".)
Thank you for your most kind and patient help.
Best regards,
Kausik
-----Original Message-----
From: Jim Procter [mailto:foreveremain@gmail.com] On Behalf Of James Procter
Sent: Tuesday, October 18, 2016 1:09 PM
To: Kausik Datta <kdatta1@jhmi.edu>
Subject: Re: [Jalview-discuss] Problems installing and then running Jalview on Windows 10
Hi Kausik - I'm very glad that you've found a way that works ! (sleep is hard for me when I know there are Jalview users out there with problems!).
I'll again take your questions in turn. As I guess you'd prefer, I've not cc'ed the discussion list.
On 18/10/2016 17:41, Kausik Datta wrote:
gi>152212369|gb|ABS31340.1| beta-tubulin, partial [Aspergillus
gi>152212369|gb|acanthosporus]
<snip>
The last four sequences, under different accessions, are 100%
identical peptides. The first one, a larger peptide, contains the
entire sequence of the last four in it. I want to refine my FASTA file
to eliminate the replicate sequences post hoc.
OK. Before I go into your questions, let me suggest this workflow:
1. Use one of the alignment services on your imported sequence set (Webservice->Alignment->Mafft With Defaults should work fine).
2. Use Remove Redundancies as before. It should work as expected.
Now to explain why this works... your question:
(a) The sequence alignment for all five sequences start at position
1. Which is why Jalview might be missing (at least the graphical
representation) that all 5 of these proteins are identical. It thinks
the first sequence is off by an amino acid (an ‘S’ stands out)
Jalview only does what you tell it - except for a few defaults. In this case, you've given Jalview a set of unaligned sequences in a FASTA file, where for each sequence, no start position was specified, so Jalview has assigned position 1 to the first residue in each sequence, and shown them with out any gaps, since none were present in the initial file.
(b) I pressed Control+D to remove the redundancies. It removed
sequences 3-5, leaving 1 and 2 – which, clearly, it considers separate
sequences. QUESTION: Is it possible to have Jalview recognize that the
smaller peptide is actually a part of the larger peptide?
See my suggested workflow. The Redundancy dialog computes percent-identity between sequences based on the current alignment, rather than the unaligned pair. We've got an outstanding enhancement about this (see http://issues.jalview.org/browse/JAL-514), but it has not yet been implemented.
(c) When I try to export the output to FASTA (attached temp2.fasta
file), it seems to retain the trailing gap marks (----) which will
likely cause issues if I try to use this FASTA file for any downstream
search process. QUESTION: Is it possible to eliminate these trailing
‘-‘ character gap markers from the generated FASTA file?
It looks like you have 'Pad Gaps' enabled by default. If this is a one-off procedure, then first select 'Edit->Pad Gaps' to untick it, and then select 'Edit->Remove all gaps' to remove all the '-' symbols. You should then be able to export the file.
If you want to disable pad-gaps for all new alignments, you can disable 'Pad Gaps' via the 'Editing' panel in your Jalview's Preferences (Tools->Preferences).
Hope this helps ! Let me know if you have any more questions.
Jim.
PS. May I send a modified version of this email to Jalview-discuss for the benefit of other people on the list ?
_______________________________________________
Jalview-discuss mailing list
Jalview-discuss@jalview.org
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss