Input Files and Multiple Alignment

Greetings.

I am attempting to use Jalview for the first time. I have watched some of the Jalview videos and read some of the Jalview manual. I would like to do multiple sequence alignment for the sole purpose of making trees and performing some PCA (PCoA) analysis.

…But I am having some issues getting started. I have 26 E. coli genomes that I would like to align to a reference genome (EDL933) – however the 26 E. coli genomes are not complete. The 26 genomes are draft assemblies that are in fasta (nucleotide) format, and have many contigs in each fasta file.

I have tried to drag and drop the fasta files in one of the windows. Assuming I do not try to drag too many files, I am able to look at each file in its own separate window. However, each contig is listed as what looks like a new sequence that can be aligned rather than as a single fasta file (containing multiple contigs). I did do a multi-fasta file alignment (to the reference) using Mauve, but when I try to look at the alignment file in Jalview, I get the memory error, and all of the contigs are still there.

How should I address the contig/fasta file issue?

I am using the desktop version of Jalview on a CentOS computer.

Any assistance is greatly appreciated. - cer

···

On Mon, Sep 19, 2016 at 4:01 PM, cricket <errcricket@gmail.com> wrote:

Greetings.

I am attempting to use Jalview for the first time. I have watched some of the Jalview videos and read some of the Jalview manual. I would like to do multiple sequence alignment for the sole purpose of making trees and performing some PCA (PCoA) analysis.

…But I am having some issues getting started. I have 26 E. coli genomes that I would like to align to a reference genome – however the 26 E. coli genomes are not complete. The 26 genomes are draft assemblies that are in fasta (nucleotide) format, and have many contigs in each fasta file.

ISSUES:

  1. I generally run out of memory when trying to use more than one file (Out of memory when calculating consensus!!!).
  • I have opened the Jalview.lax file and put the following as the last two lines (but I am still getting the memory error):
  • lax.nl.java.option.java.heap.size.max=1000m
  • lax.nl.java.option.java.heap.size.initial=900m1. I have tried to drag and drop the fasta files in one of the windows. Assuming I do not try to drag too many files, I am able to look at each file in its own separate window. However, each contig is listed as what looks like a new sequences that can be aligned rather than as a single fasta file with multiple contigs. I did do a multi-fasta file alignment (to the reference) using Mauve, but when I try to look at the alignment, I get the memory error, and all of the contigs are still there.

QUESTIONS:

  1. Each fasta file (n=26) is around 5MB. How much should I increase my memory so that it is an appropriate amount?
  2. How should I address the contig/fasta file issue?

I am using the desktop version of Jalview on a CentOS computer.

[root@c588 jalview]# free -mh
total used free shared buffers cached
Mem: 126G 124G 1.3G 18M 468M 114G
-/+ buffers/cache: 9.4G 116G
Swap: 63G 62M 63G

Any assistance is greatly appreciated. - cer

Hello cricket,

For the memory issue, can you try higher max heap sizes e.g. 2000m?

For the contig/fasta issue, is it possible to email one example file (by personal email if you prefer) so I can be sure I am understanding the issue correctly?

Thanks,

Mungo

···

Mungo Carstairs
Jalview Computational Scientist
The Barton Group
Division of Computational Biology
School of Life Sciences
University of Dundee, Dundee, Scotland, UK.
www.jalview.org
www.compbio.dundee.ac.uk


From: jalview-discuss-bounces@jalview.org jalview-discuss-bounces@jalview.org on behalf of cricket work.err29@gmail.com
Sent: 20 September 2016 12:24:16
To: jalview-discuss@jalview.org
Subject: Re: [Jalview-discuss] Input Files and Multiple Alignment

Greetings.

I am attempting to use Jalview for the first time. I have watched some of the Jalview videos and read some of the Jalview manual. I would like to do multiple sequence alignment for the sole purpose of making trees and performing some PCA (PCoA) analysis.

…But I am having some issues getting started. I have 26 E. coli genomes that I would like to align to a reference genome (EDL933) – however the 26 E. coli genomes are not complete. The 26 genomes are draft assemblies that are in fasta (nucleotide) format, and have many contigs in each fasta file.

I have tried to drag and drop the fasta files in one of the windows. Assuming I do not try to drag too many files, I am able to look at each file in its own separate window. However, each contig is listed as what looks like a new sequence that can be aligned rather than as a single fasta file (containing multiple contigs). I did do a multi-fasta file alignment (to the reference) using Mauve, but when I try to look at the alignment file in Jalview, I get the memory error, and all of the contigs are still there.

How should I address the contig/fasta file issue?

I am using the desktop version of Jalview on a CentOS computer.

Any assistance is greatly appreciated. - cer

On Mon, Sep 19, 2016 at 4:01 PM, cricket <errcricket@gmail.com> wrote:

Greetings.

I am attempting to use Jalview for the first time. I have watched some of the Jalview videos and read some of the Jalview manual. I would like to do multiple sequence alignment for the sole purpose of making trees and performing some PCA (PCoA) analysis.

…But I am having some issues getting started. I have 26 E. coli genomes that I would like to align to a reference genome – however the 26 E. coli genomes are not complete. The 26 genomes are draft assemblies that are in fasta (nucleotide) format, and have many contigs in each fasta file.

ISSUES:

  1. I generally run out of memory when trying to use more than one file (Out of memory when calculating consensus!!!).
    • I have opened the Jalview.lax file and put the following as the last two lines (but I am still getting the memory error):
    • lax.nl.java.option.java.heap.size.max=1000m
    • lax.nl.java.option.java.heap.size.initial=900m
  2. I have tried to drag and drop the fasta files in one of the windows. Assuming I do not try to drag too many files, I am able to look at each file in its own separate window. However, each contig is listed as what looks like a new sequences that can be aligned rather than as a single fasta file with multiple contigs. I did do a multi-fasta file alignment (to the reference) using Mauve, but when I try to look at the alignment, I get the memory error, and all of the contigs are still there.

QUESTIONS:

  1. Each fasta file (n=26) is around 5MB. How much should I increase my memory so that it is an appropriate amount?
  2. How should I address the contig/fasta file issue?

I am using the desktop version of Jalview on a CentOS computer.

[root@c588 jalview]# free -mh
total used free shared buffers cached
Mem: 126G 124G 1.3G 18M 468M 114G
-/+ buffers/cache: 9.4G 116G
Swap: 63G 62M 63G

Any assistance is greatly appreciated. - cer

The University of Dundee is a registered Scottish Charity, No: SC015096

Hi Cricket.. and Mungo.

Interesting thread ! I just wanted to add a couple of hints:

For the memory issue, can you try higher max heap sizes e.g. 2000m?

If you have javaws on your path, then try this:
javaws
'http://www.jalview.org/services/launchApp?jvm-max-heap=15G&version=Develop

Assuming you have openjdk 1.8 or later, then you should be able to
change '15G' to *just under* your machine's physical memory - ie 100G,
if I read your 'swap -mh' output right.

The 'version=Develop' means you'll be launching the development version,
which will basically be version 2.10 of Jalview when we release it in
the next week or so. I'd recommend using this version, since we've been
optimising Jalview so it works better with genomic data.

file in its own separate window. However, each contig is listed as what
looks like a new sequence that can be aligned rather than as a single
fasta file (containing multiple contigs).

Yes. Most draft assemblies look like this !

I did do a multi-fasta file

alignment (to the reference) using Mauve, but when I try to look at the
alignment file in Jalview, I get the memory error, and all of the
contigs are still there.

Presumably Mauve produced a single fasta or GFF3 output file ? It is
this file you should try to read in to Jalview.

Just a final word - Jalview really works best for analysing loci - so if
you have any putative CDS annotation, you probably want to import them, too.

Let us know how you get on !
Jim

···

On 20/09/2016 14:05, Mungo Carstairs (Staff) wrote: