test files in stockholm format

Morning, Natasha.

I've written a test code and It seems working properly. The only
problem is "non-standard" (a-ka non-secondary-structure) sequence
annotations. Is those annotations important in Jalview. For example, I
have original file, like this

O83071/192-246
MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRVPVYERS
*#=GR O83071/192-246 SA *
999887756453524252..55152525....36463774777
O83071/259-312
MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVAIVLDEY
*#=GR O83071/259-312 SS *
CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEE

Parsing code saves alignment annotation for secondary structure only
(i.e. *#=GR O83071/259-312 SS ...), *but not this one: *#=GR
O83071/192-246 SA *

My output file looks like

O83071/192-246
MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRVPVYERS
O83071/259-312
MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVAIVLDEY
*#=GR O83071/259-312 SS *
CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEE

Is it Ok? If no, I have an idea how to change the parsing Stockholm
code and after that I can save all alignment annotations.

I checked the parsing code and what you see is exactly right - although the 'SA' annotation is supported, the parser doesn't do anything with it.
It'd be great if you fix that issue too - but since its not quite related to stockholm output, I've raised it as another enhancement:

http://issues.jalview.org/browse/JAL-1207

Remember when you check the code in, mention the appropriate JAL-XXX issue numbers in the commit message so everyone can see which commits relate to which issue at issues.jalview.org

Looking forward to trying out (or getting others to try out..) the new code !
Jim.

···

On Mon Nov 19 10:55:47 2012, Nataliya Sherstneva wrote:

Hi Nataliya - still catching up on my email backlog after being sick and on holiday (both at the same time :frowning: ).

I've finished my code and pushed it into central repository (there is
a new branch, based on recent Release_2_8).
I've changed parsing() as we agreed, developed print() in the
StockholmFile class and slightly changed code in the
AppletFormatAdapter class.

great!

I haven't pushed my test code yet. If you'd like I can do it as well.

It's probably worth doing that - I'm trying to get into the habit of making test cases, and found that it is useful for others to see how to use the code you develop. Don't worry if it they are untidy ! you (or someone else) can always improve them later :slight_smile:

And about SequenceI.getDBRef(). I've seen in file in Stockholm Format
lines like this:
#=GS O31698/18-71 AC O31698
#=GS O83071/192-246 AC O83071
#=GS O83071/259-312 AC O83071

but the StockholmFile parsing method doesn't save it. Do you mean another format can save this DBRef

and I should check and print its?

ah. yes. This is another issue. One of the problems with stockholm format is that for some stockholm files - additional information is needed in order to identify which database a particular 'AC' annotation refers to. This is exactly the problem with Rfam and Pfam - see this bug: http://issues.jalview.org/browse/JAL-851

I took another look around at the bunch of tools/databases where stockholm is supported and I think there isn't going to be a perfect solution.

For parsing:
1. if there are records like:
#=GS DR Uniprot; O31698 - then we can create a full DBRef object immediately.
2. If there are only AC records, then there needs to be an 'assume database name' variable - say we call it defaultDB. This could be set by the code that constructs jalview.io.StockholmFile using a get/set method, or be set automatically if it looks like we are accessing an Xfam database.. eg:

if the file contains an alignment database reference like:
#=GF AC PF......
then we can assume it's an alignment file originally from Pfam (all pfam alignments have IDs like PF012345...), and the database accessions will most likely be Uniprot database accessions.

Alternately, if it has:
#=GF AC RF......

Then it's most likely to be an Rfam alignment. The default database here is more tricky - but in many cases, it will be an EMBL accession.

Jim.

···

On Mon Nov 26 10:03:11 2012, Nataliya Sherstneva wrote:

Hi Nataliya - still catching up on my email backlog after being sick and on holiday (both at the same time :frowning: ).

I've finished my code and pushed it into central repository (there is
a new branch, based on recent Release_2_8).
I've changed parsing() as we agreed, developed print() in the
StockholmFile class and slightly changed code in the
AppletFormatAdapter class.

great!

I haven't pushed my test code yet. If you'd like I can do it as well.

It's probably worth doing that - I'm trying to get into the habit of making test cases, and found that it is useful for others to see how to use the code you develop. Don't worry if it they are untidy ! you (or someone else) can always improve them later :slight_smile:

And about SequenceI.getDBRef(). I've seen in file in Stockholm Format
lines like this:
#=GS O31698/18-71 AC O31698
#=GS O83071/192-246 AC O83071
#=GS O83071/259-312 AC O83071

but the StockholmFile parsing method doesn't save it. Do you mean another format can save this DBRef

and I should check and print its?

ah. yes. This is another issue. One of the problems with stockholm format is that for some stockholm files - additional information is needed in order to identify which database a particular 'AC' annotation refers to. This is exactly the problem with Rfam and Pfam - see this bug: http://issues.jalview.org/browse/JAL-851

I took another look around at the bunch of tools/databases where stockholm is supported and I think there isn't going to be a perfect solution.

For parsing:
1. if there are records like:
#=GS DR Uniprot; O31698 - then we can create a full DBRef object immediately.
2. If there are only AC records, then there needs to be an 'assume database name' variable - say we call it defaultDB. This could be set by the code that constructs jalview.io.StockholmFile using a get/set method, or be set automatically if it looks like we are accessing an Xfam database.. eg:

if the file contains an alignment database reference like:
#=GF AC PF......
then we can assume it's an alignment file originally from Pfam (all pfam alignments have IDs like PF012345...), and the database accessions will most likely be Uniprot database accessions.

Alternately, if it has:
#=GF AC RF......

Then it's most likely to be an Rfam alignment. The default database here is more tricky - but in many cases, it will be an EMBL accession.

Jim.

···

On Mon Nov 26 10:03:11 2012, Nataliya Sherstneva wrote:

oops - sorry folks - didn't mean to send this to jalview-discuss - but if anyone is interested in trying out Natasha's new stockholm writer, watch out for a new build of the jalview development version in the next few hours... hopefully!

Jim.
ps. Natasha - could you reply on the jalview-dev list rather than here, or direct to me ? :slight_smile:

···

On 03/12/2012 11:23, Jim Procter wrote:

...

Hi Natasha.

I've done output for a case like this:

    #=GS DR Uniprot; O31698

ok. that should be fine.

Could you please clarify about Pham AC numbers. So if I recognized a
file keeps alignment from Pfam, where protein IDs from the file should
go, to DBRefEntry or anywhere else?

All accesions should go into a DBRefEntry - Jalview does some normalisation, to remove redundant accessions, but it needs looking at to make sure (basically, everything works on the version number in the DBRefEntry).

In some files in Stockholm Format there is a line like this
#=GF AC PF...
so, I know that this is the Pham database. It's clear for me.

Also I've noticed, that when we generate file in Pham in Stockholm
format, this file doesn't have a line like
#=GF AC PF...

The reason for this is that the =GF tag is for *alignment wide* annotations. In the Pfam parser, you'll see that these annotations are added to the AlignmentI object via the 'setAlignmentProperty(..,..)' method. For the 'print()' method, you should first generate =GF type annotation from the key value pairs obtained from AlignmentI.getProperty().

so in this case, should I analyze AC from this:

#=GS A7IWM5_PBCVN/3-170 AC A7IWM5.1

by using regex like [A-Z][0-9][A-Z0-9]{4} .?

Yes. Except you'll also want to capture the '1' after the '.' as the version number (there's a field in DBRefEntry for this). However, that's only the case for Pfam. In the case of Rfam it will very definitely depend on where the accession has come from (sometimes, database acccessions include '.1' or '.2' to indicate the first or second processed product, or to refer to a subpart of the parent accession).

With regard to testing the generation/import of these GS AC records - I'm not sure how useful it is to regenerate these bare GS AC records. They are only really valid in the context of the parent alignment's own accession code (e.g. PF or RF), so jalview should only generate them if they came from an alignment with a PF/RF source.

something to ponder over the weekend...
Jim.

···

On Wed Dec 5 13:54:56 2012, Nataliya Sherstneva wrote: