Alignment Annotation File - SEQUENCE_REF format

steffen_schmidt · 15 January 2014 10:23

Hi,

I your manual about annotation files you describe:
http://www.jalview.org/help/oldhelp/html/features/annotationsFormat.html

…
You can associate an annotation with a sequence by preceding its definition with the line:

SEQUENCE_REF seq_name [startIndex]
…

I wonder what the exact format of seq_name is:

Image I get a fasta file like this:

db>183474|my_pet_protein

Do I have to put in the full id or are other variations ok?

SEQUENCE_REF db|183474|my_pet_protein 1
SEQUENCE_REF 183474 1
SEQUENCE_REF my_pet_protein 1

Background: Since most often accession numbers don’t tell you the species name, I would like to add the species info to the sequence name to quickly spot the organism. e.g. my_pet_protein|Escherichia_coli. But then, I would need to change the annotation file seq_name if I can’t use a shorthand…

Thanks
Steffen

jalviewcrowdadmin · 15 January 2014 11:33

Hi Steffen - thanks for your mail!

Steffen Schmidt wrote:

I your manual about annotation files you describe:
http://www.jalview.org/help/oldhelp/html/features/annotationsFormat.html

...
You can associate an annotation with a sequence by preceding its definition with the line:

SEQUENCE_REFseq_name[startIndex]
...

I wonder what the exact format of seq_name is:

Image I get a fasta file like this:

db>183474|my_pet_protein

Do I have to put in the full id or are other variations ok?

SEQUENCE_REFdb|183474|my_pet_protein1
SEQUENCE_REF1834741
SEQUENCE_REFmy_pet_protein1

Background: Since most often accession numbers don’t tell you the species name, I would like to add the species info to the sequence name to quickly spot the organism. e.g. my_pet_protein|Escherichia_coli. But then, I would need to change the annotation file seq_name if I can’t use a shorthand…

Jalview's annotation file format works on exact string matches to associate tracks with a sequence. We made that decision because the format was designed to be a way for other programs to generate data for import in to Jalview.

It is reasonably straightforward to allow substring based matching like you suggest - Jalview does that for Newick tree import already, so the function is available - so I can create a patch right away, if you like. I've created a new feature request for this at http://issues.jalview.org/browse/JAL-1427

However, there might be some backwards compatibility problems in the case where an alignment includes different sequences where one sequence's ID is wholly contained in another, so I don't think I can make substring matching the default behaviour when parsing the SEQUENCE_REF tag in annotation files. Any thoughts ?

Jim.