Cleaning up protein sequences

Hi,

Suppose I've created a protein alignment based on a few hundred
sequences from a PSI-BLAST query. Many of the sequences in this
alignment will contain extra junk regions or, conversely, omitted
regions. I want to clean up these sequences so I can re-align them and
analyze them further.

When I'm reviewing my alignment in Jalview, I'd like to change the
junk regions into gaps. I don't want to just delete the junk with the
backspace key, because that would mess up the alignment and make my
review harder.

As far as I can tell, there's no single command that would allow me to
change a selection to all gaps in Jalview. I've thought of a few
different workarounds, but they are pretty inefficient.

Am I missing something? Or is there better software than Jalview to
use for this kind of thing?

Actually, the ideal general solution to my problem would be for
Jalview to automatically run tblastn on specified sequences. Then it
could either find a missing region and insert it or verify that a long
run of junk-like stuff is in fact junk and delete it. This is what I
do manually for particularly valuable sequences. But I realize that
this might be hard to automate.

Thanks,
Jeremy

Hi Geoff,

Hiding is often useful, but I don't think it would help in this case.
The reason is that often, the junk I want to get rid of is just in one
sequence. Often the regions of other sequences aligned to the junk
region are OK, so I don't want to get rid of those regions. As far as
I can tell, if I want to hide just a subset of my columns, I have to
hide them in every sequence of the alignment.

Thanks,
Jeremy

···

On Mon, Jul 26, 2010 at 3:39 AM, Geoff Barton <g.j.barton@dundee.ac.uk> wrote:

I'm not sure if this is exactly what you want, but in Jalview you can "hide"
columns or rows much like you can in a spreadsheet. Any subsequent
operation (e.g. alignment) is then done only on what is visible. I use this
feature to help clean up big alignments without losing the full sequences,
or to keep a big alignment while only working on a small part. This feature
is also useful for selecting a subset of the alignment for display or
subsequent tree building etc.

Geoff.

Jeremy Semeiks wrote:

Hi,

Suppose I've created a protein alignment based on a few hundred
sequences from a PSI-BLAST query. Many of the sequences in this
alignment will contain extra junk regions or, conversely, omitted
regions. I want to clean up these sequences so I can re-align them and
analyze them further.

When I'm reviewing my alignment in Jalview, I'd like to change the
junk regions into gaps. I don't want to just delete the junk with the
backspace key, because that would mess up the alignment and make my
review harder.

As far as I can tell, there's no single command that would allow me to
change a selection to all gaps in Jalview. I've thought of a few
different workarounds, but they are pretty inefficient.

Am I missing something? Or is there better software than Jalview to
use for this kind of thing?

Actually, the ideal general solution to my problem would be for
Jalview to automatically run tblastn on specified sequences. Then it
could either find a missing region and insert it or verify that a long
run of junk-like stuff is in fact junk and delete it. This is what I
do manually for particularly valuable sequences. But I realize that
this might be hard to automate.

Thanks,
Jeremy
_______________________________________________
Jalview-discuss mailing list
Jalview-discuss@jalview.org
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss

--
Geoff Barton, Professor of Bioinformatics, College of Life Sciences
University of Dundee, Scotland, UK. g.j.barton@dundee.ac.uk
Tel:+44 1382 385860/388731 (Fax:385764) www.compbio.dundee.ac.uk

The University of Dundee is registered Scottish charity: No.SC015096

Hi Jeremy,

No, you are right that hiding will not (currently) do what you need. However, you can select a region, right click on it and then choose "edit sequence". If your region has more than one sequence in it, then the edits are applied to every sequence in the region. So, if you want to change everything in the region to a gap character, then just replace the sequence in the box with gap characters. This should do what you want without too much pain.

I hope this helps!

Geoff.

Jeremy Semeiks wrote:

···

Hi Geoff,

Hiding is often useful, but I don't think it would help in this case.
The reason is that often, the junk I want to get rid of is just in one
sequence. Often the regions of other sequences aligned to the junk
region are OK, so I don't want to get rid of those regions. As far as
I can tell, if I want to hide just a subset of my columns, I have to
hide them in every sequence of the alignment.

Thanks,
Jeremy

On Mon, Jul 26, 2010 at 3:39 AM, Geoff Barton <g.j.barton@dundee.ac.uk> wrote:

I'm not sure if this is exactly what you want, but in Jalview you can "hide"
columns or rows much like you can in a spreadsheet. Any subsequent
operation (e.g. alignment) is then done only on what is visible. I use this
feature to help clean up big alignments without losing the full sequences,
or to keep a big alignment while only working on a small part. This feature
is also useful for selecting a subset of the alignment for display or
subsequent tree building etc.

Geoff.

Jeremy Semeiks wrote:

Hi,

Suppose I've created a protein alignment based on a few hundred
sequences from a PSI-BLAST query. Many of the sequences in this
alignment will contain extra junk regions or, conversely, omitted
regions. I want to clean up these sequences so I can re-align them and
analyze them further.

When I'm reviewing my alignment in Jalview, I'd like to change the
junk regions into gaps. I don't want to just delete the junk with the
backspace key, because that would mess up the alignment and make my
review harder.

As far as I can tell, there's no single command that would allow me to
change a selection to all gaps in Jalview. I've thought of a few
different workarounds, but they are pretty inefficient.

Am I missing something? Or is there better software than Jalview to
use for this kind of thing?

Actually, the ideal general solution to my problem would be for
Jalview to automatically run tblastn on specified sequences. Then it
could either find a missing region and insert it or verify that a long
run of junk-like stuff is in fact junk and delete it. This is what I
do manually for particularly valuable sequences. But I realize that
this might be hard to automate.

Thanks,
Jeremy
_______________________________________________
Jalview-discuss mailing list
Jalview-discuss@jalview.org
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss

--
Geoff Barton, Professor of Bioinformatics, College of Life Sciences
University of Dundee, Scotland, UK. g.j.barton@dundee.ac.uk
Tel:+44 1382 385860/388731 (Fax:385764) www.compbio.dundee.ac.uk

The University of Dundee is registered Scottish charity: No.SC015096

_______________________________________________
Jalview-discuss mailing list
Jalview-discuss@jalview.org
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss

--
Geoff Barton, Professor of Bioinformatics, College of Life Sciences
University of Dundee, Scotland, UK. g.j.barton@dundee.ac.uk
Tel:+44 1382 385860/388731 (Fax:385764) www.compbio.dundee.ac.uk

The University of Dundee is registered Scottish charity: No.SC015096

Geoff,

Yes, this editing trick is one of the workarounds I've found. However,
using only Jalview it's pretty painful because I either have to
manually count characters or paste in a gap region of the desired
length from another sequence. The latter method would not be so
painful if the FASTA title didn't automatically paste in with the
gaps. I saw in the list archives some discussion about removing the
auto-paste-FASTA-title feature, but as far as I can tell this has not
yet been implemented.

If there's a better way to use this trick, please let me know.

Actually, the ideal way to use the paste-sequence method would
probably just be to copy the gaps, highlight the junk, and hit
"Ctrl-V". Then Jalview would automatically paste just the sequence in
the clipboard, sans FASTA header, replacing the contents of the
highlighted junk region. The current text-box method involves too much
right-clicking of context menus, which is inefficient, unintuitive,
and nonstandard.

- Jeremy

···

On Tue, Jul 27, 2010 at 5:30 AM, Geoff Barton <g.j.barton@dundee.ac.uk> wrote:

Hi Jeremy,

No, you are right that hiding will not (currently) do what you need.
However, you can select a region, right click on it and then choose
"edit sequence". If your region has more than one sequence in it, then
the edits are applied to every sequence in the region. So, if you want
to change everything in the region to a gap character, then just replace
the sequence in the box with gap characters. This should do what you
want without too much pain.

I hope this helps!

Geoff.

Jeremy Semeiks wrote:

Hi Geoff,

Hiding is often useful, but I don't think it would help in this case.
The reason is that often, the junk I want to get rid of is just in one
sequence. Often the regions of other sequences aligned to the junk
region are OK, so I don't want to get rid of those regions. As far as
I can tell, if I want to hide just a subset of my columns, I have to
hide them in every sequence of the alignment.

Thanks,
Jeremy

On Mon, Jul 26, 2010 at 3:39 AM, Geoff Barton <g.j.barton@dundee.ac.uk> wrote:

I'm not sure if this is exactly what you want, but in Jalview you can "hide"
columns or rows much like you can in a spreadsheet. Any subsequent
operation (e.g. alignment) is then done only on what is visible. I use this
feature to help clean up big alignments without losing the full sequences,
or to keep a big alignment while only working on a small part. This feature
is also useful for selecting a subset of the alignment for display or
subsequent tree building etc.

Geoff.

Jeremy Semeiks wrote:

Hi,

Suppose I've created a protein alignment based on a few hundred
sequences from a PSI-BLAST query. Many of the sequences in this
alignment will contain extra junk regions or, conversely, omitted
regions. I want to clean up these sequences so I can re-align them and
analyze them further.

When I'm reviewing my alignment in Jalview, I'd like to change the
junk regions into gaps. I don't want to just delete the junk with the
backspace key, because that would mess up the alignment and make my
review harder.

As far as I can tell, there's no single command that would allow me to
change a selection to all gaps in Jalview. I've thought of a few
different workarounds, but they are pretty inefficient.

Am I missing something? Or is there better software than Jalview to
use for this kind of thing?

Actually, the ideal general solution to my problem would be for
Jalview to automatically run tblastn on specified sequences. Then it
could either find a missing region and insert it or verify that a long
run of junk-like stuff is in fact junk and delete it. This is what I
do manually for particularly valuable sequences. But I realize that
this might be hard to automate.

Thanks,
Jeremy
_______________________________________________
Jalview-discuss mailing list
Jalview-discuss@jalview.org
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss

--
Geoff Barton, Professor of Bioinformatics, College of Life Sciences
University of Dundee, Scotland, UK. g.j.barton@dundee.ac.uk
Tel:+44 1382 385860/388731 (Fax:385764) www.compbio.dundee.ac.uk

The University of Dundee is registered Scottish charity: No.SC015096

_______________________________________________
Jalview-discuss mailing list
Jalview-discuss@jalview.org
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss

--
Geoff Barton, Professor of Bioinformatics, College of Life Sciences
University of Dundee, Scotland, UK. g.j.barton@dundee.ac.uk
Tel:+44 1382 385860/388731 (Fax:385764) www.compbio.dundee.ac.uk

The University of Dundee is registered Scottish charity: No.SC015096

_______________________________________________
Jalview-discuss mailing list
Jalview-discuss@jalview.org
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss