Sequence naming conventions - what are they?

g_m_carstairs · 7 June 2016 10:59

Currently if you fetch from Uniprot, EMBL a compound sequence name is made e.g.

UNIPROT>accession>accession>accession>…|name|name|…

PDB>pdbId>name>chain>id (?)

EMBL>accession

but if fetching from Pfam, Rfam or Ensembl the sequence name is just the accession id.

Is there a rationale to this?

I would like to know since SequenceIdMatcher depends on it.

It has to know to look for a sequence called “UNIPROT|P1560” to resolve a UNIPROT database reference, but not to include the source database if resolving an ENSEMBL reference, which seems ad hoc.

Or does this problem go away when ‘primary db reference’ (JAL-2106) is, well, resolved? Which will I guess remove the overloading of the sequence name with this information.

Any thoughts?

thanks

The University of Dundee is a registered Scottish Charity, No: SC015096

···

Mungo Carstairs
Jalview Computational Scientist
The Barton Group
Division of Computational Biology
School of Life Sciences
University of Dundee, Dundee, Scotland, UK.
www.jalview.org
www.compbio.dundee.ac.uk