GenBank parser

I’m trying to figure out how to populate dbRefs. Seeing the file I guess that dbRefs are /db_xref items. Right?

Not necessarily - the semantics are not quite the same, even though the label sounds similar. A DBRefEntry in jalview is an accession Id to some external database that somehow relates to the sequence entry.

If this is the case, to build a DBRefEntry I need source, version, accessionId and mapping.

Take a look at the DBRefEntry constructor javadoc:

39 /**
40 *
41 * @param source
42 * canonical source (uppercase only)
43 * @param version
44 * (source dependent version string)
45 * @param accessionId
46 * (source dependent accession number string)
47 * @param map
48 * (mapping from local sequence numbering to source accession
49 * numbering)
50 */
51 public DBRefEntry(String source, String version, String accessionId,
52 Mapping map)

source and accession must be non-null, but version and map may be null. ‘source’ is typically a canonical name for a database source - the jalview.utils.DbRefSource class includes some hardcoded constants for sources that have special meaning to Jalview (at some point the hardcoded strings will be replaced by a more rigorous canonical name lookup system).

In something like this:
source 1…5028
/organism=“Saccharomyces cerevisiae”

Is it possible to consider source=organism, accessionId=db_xref, version=xx and mapping=yy?

No. ‘organism’ isn’t a source name. ‘source’ is annotation describing the source of the sequence data in the record. The only dbrefentry that might come out of this is a cross reference to the NCBI taxon database - so ‘source’ == ‘taxon’ and ‘accession’==‘4932’ for the DbRefEntry.

And if it is:
CDS <1…206

Is it possible to consider source=protein_id, accessionId=db_xref, version=product and mapping=xx? However, at the explanation of the Sample Record, when talking about protein_id…“A protein sequence identification number, similar to the Version number of a nucleotide sequence. Protein IDs consist of three letters followed by five digits, a dot, and a version number.”. Should I parse protein_id at use the result to populate version (1 in this case)?

That seems right.

CDS annotation define the coding regions for proteins. The logic in this method:;a=blob;f=src/jalview/datamodel/xdb/embl/;h=0ae49b998d1d27a1f8dac69e6eb10a4a476d4944;hb=HEAD

484 * attempt to extract coding region and product from a feature and properly
485 * decorate it with annotations.
486 *
487 * @param feature
488 * coding feature
489 * @param sourceDb
490 * source database for the EMBLXML
491 * @param seqs
492 * place where sequences go
493 * @param dna
494 * parent dna sequence for this record
495 * @param noPeptide
496 * flag for generation of Peptide sequence objects
497 */
498 private void parseCodingFeature(EmblFeature feature, String sourceDb,
499 Vector seqs, Sequence dna, boolean noPeptide)
500 {
501 boolean isEmblCdna = sourceDb.equals(DBRefSource.EMBLCDS);

does the transformation for the XML version.

And, finally, I see some correspondence between gene and CDS entries. Each time there is a gene entry, you’ll find a CDS entry following the first, and you’ll be able to relate them using /gene and A…B fields.

gene 687…3158
CDS 687…3158
/note=“plasma membrane glycoprotein”
/function=“required for axial budding pattern of S.

Is it possible to consider source=gene, accessionId=db_xref, version=product and mapping=xx?

again - db_xref is parsed into source=‘GI’, and accessionId=“1293615” here for one DBRefEntry.

Have a look at the logic in the parseCodingFeature method. The easiest way may be to adapt the code to work with the classes you populate from the GenBank record… but you might even consider going the other way, and adapt the GenBank ‘parse’ routine to create objects from the the jalview.datamodel.xdb.embl package.

Sorry for the delay in replying… I’m just trying to get a Jalview release out of the door…


Hi David.

Answer below - I started to write this then got distracted by other things, as usual.

On 24/01/2014 21:21, David Roldán Martínez wrote: