Non - Delimited FASTA output

jalviewcrowdadmin · 26 January 2011 17:25

Hi Jared, Peter [and Peter!]

I’ve cc’ed this to Peter Cock, who maintains biopython, because - IMHO, it sounds like there is a problem with biopython’s parser (see e.g. http://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml).

As for workarounds, Jalview doesn’t have a switch to prevent the pretty-printing of FASTA files, and I’m afraid I’m not convinced its worth it. However, if you are actually coding, you could follow Peter Troshin’s suggestion, and use :

compbio.data.sequence.FastaSequence.getOnelineFasta() does that.

This java class can be found in the min-jaba-client.jar in Jalview’s lib directory, or downloaded from the JABA web site : http://www.compbio.dundee.ac.uk/jabaw

However, I suspect you aren’t actually coding in Java, in which case, the easiest would be to :

Use a different output format from Jalview to pass the data to biopython.
Pass it through a tool like EMBOSS’s seqret (http://emboss.sourceforge.net/docs/themes/SequenceFormats.html#change) to normalise it.
Pipe the file through a script to remove the newlines

Hope that helps!
Jim.

···

Subject: [Jalview-discuss] Non - Delimited FASTA output
Date: Wed, 26 Jan 2011 11:10:52 -0500
From: Ackers, Jared (DEWN)
To: jalview-discuss@jalview.org jalview-discuss@jalview.org

-- 
-------------------------------------------------------------------
J. B. Procter  (JALVIEW/ENFIN)  Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764  [http://www.compbio.dundee.ac.uk](http://www.compbio.dundee.ac.uk)
The University of Dundee is a Scottish Registered Charity, No. SC015096.

ackers_jared_dewn · 26 January 2011 18:23

Jim,

Thanks for the advice. I don't think that BioPython has a problem, I've found it very useful to go back and forth between tab delimited and fasta formats when necessary. I should stress that I only use it as a parser, all other functions are performed elsewhere. The only minor inconvenience I've found is that it sometimes has problems with long pathlengths.

Sadly, I don't code. I maintain a rather large flatfile database of annotated sequences, and it is necessary to have the sequence in a single field. My problem is that occasionally I want to re-upload manipulated sequences to that database. When I do this, I have to do it with multiple (still quite large) subsets of data, so I'm always looking to loose a step. If the fasta file contains line feeds, each line of a sequence becomes a new record; and I can't get that program to ignore line feeds.

I've just found that if you open the alignment in MEGA 5 and export as a FASTA (*.mas) file you can eliminate the line feeds.

Thanks,

Jared

···

________________________________
From: jalview-discuss-bounces@jalview.org [mailto:jalview-discuss-bounces@jalview.org] On Behalf Of Jim Procter
Sent: Wednesday, January 26, 2011 12:26 PM
To: jalview-discuss@jalview.org
Cc: Peter
Subject: Re: [Jalview-discuss] Fwd: Non - Delimited FASTA output

Hi Jared, Peter [and Peter!]

I've cc'ed this to Peter Cock, who maintains biopython, because - IMHO, it sounds like there is a problem with biopython's parser (see e.g. http://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml).

As for workarounds, Jalview doesn't have a switch to prevent the pretty-printing of FASTA files, and I'm afraid I'm not convinced its worth it. However, if you are actually coding, you could follow Peter Troshin's suggestion, and use :
On 26/01/2011 16:23, Peter Troshin wrote:

compbio.data.sequence.FastaSequence.getOnelineFasta() does that.
This java class can be found in the min-jaba-client.jar in Jalview's lib directory, or downloaded from the JABA web site : http://www.compbio.dundee.ac.uk/jabaw

However, I suspect you aren't actually coding in Java, in which case, the easiest would be to :
1. Use a different output format from Jalview to pass the data to biopython.
2. Pass it through a tool like EMBOSS's seqret (http://emboss.sourceforge.net/docs/themes/SequenceFormats.html#change) to normalise it.
3. Pipe the file through a script to remove the newlines

Hope that helps!
Jim.

-------- Original Message --------
Subject:

[Jalview-discuss] Non - Delimited FASTA output

Date:

Wed, 26 Jan 2011 11:10:52 -0500

From:

Ackers, Jared (DEWN)

To:

jalview-discuss@jalview.org<mailto:jalview-discuss@jalview.org> <jalview-discuss@jalview.org><mailto:jalview-discuss@jalview.org>

Is there a way to output FASTA files of an alignment that does not contain a line-feed character, i.e., one where the entire sequence is on one line? BioPython will parse Line-feed containing FASTA to tab delimited, but I was hoping to circumvent this.

Thanks,

Jared

--

-------------------------------------------------------------------

J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group

Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk

The University of Dundee is a Scottish Registered Charity, No. SC015096.

______________________________________________________________________
CAUTION: This message was sent via the Public Internet and its authenticity cannot be guaranteed.

PROPRIETARY: This e-mail contains proprietary information some or all of which may be legally privileged. It is intended for the recipient only. If an addressing or transmission error has misdirected this e-mail, please notify the authority by replying to this e-mail. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this e-mail.

jalviewcrowdadmin · 26 January 2011 20:31

understood - Peter’s comment helped clear up the situation.

Glad to hear the problem’s solved!
Jim.
ps. sounds like you actually wanted the pfam output, which is SequenceID.

···

On 26/01/2011 18:23, Ackers, Jared (DEWN) wrote:

ackers_jared_dewn · 26 January 2011 21:16

PFAM works great, thanks!

···

________________________________
From: Jim Procter [mailto:foreveremain@gmail.com] On Behalf Of Jim Procter
Sent: Wednesday, January 26, 2011 3:32 PM
To: Ackers, Jared (DEWN)
Cc: jalview-discuss@jalview.org; Peter
Subject: Re: [Jalview-discuss] Fwd: Non - Delimited FASTA output

On 26/01/2011 18:23, Ackers, Jared (DEWN) wrote:
Jim,

Thanks for the advice. I don't think that BioPython has a problem, I've found it very useful to go back and forth between tab delimited and fasta formats when necessary. I should stress that I only use it as a parser, all other functions are performed elsewhere. The only minor inconvenience I've found is that it sometimes has problems with long pathlengths.
understood - Peter's comment helped clear up the situation.

Sadly, I don't code. I maintain a rather large flatfile database of annotated sequences, and it is necessary to have the sequence in a single field. My problem is that occasionally I want to re-upload manipulated sequences to that database. When I do this, I have to do it with multiple (still quite large) subsets of data, so I'm always looking to loose a step. If the fasta file contains line feeds, each line of a sequence becomes a new record; and I can't get that program to ignore line feeds.

I've just found that if you open the alignment in MEGA 5 and export as a FASTA (*.mas) file you can eliminate the line feeds.
Glad to hear the problem's solved!
Jim.
ps. sounds like you actually wanted the pfam output, which is SequenceID<tab><sequence>.

______________________________________________________________________
CAUTION: This message was sent via the Public Internet and its authenticity cannot be guaranteed.

PROPRIETARY: This e-mail contains proprietary information some or all of which may be legally privileged. It is intended for the recipient only. If an addressing or transmission error has misdirected this e-mail, please notify the authority by replying to this e-mail. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this e-mail.