Entrez sequence fetching in Jalview

jalviewadmin · 18 September 2013 09:47

Morning David

Are there any plans to include fetching of sequences from NCBI? ie by
GI or GenBank accession?

I've been thinking about it - since it's a blocker for just about
everyone who isn't EMBL-centric.

The reason that I didn't push to do it was I thought there was an NCBI
DAS source that took GI numbers, but that seems to have disappeared (or
perhaps I imagined it :q ).

Putting something together isn't particularly difficult, but I've not
got any well tested Java lying around that I could hook up with the
Jalview sequence fetcher. Does anyone on the list have a piece of code
that can retrieve sequences and annotation from Entrez that they'd be
happy to donate to Jalview ? Otherwise, I'll try and get around to
knocking a basic 'HTTP GET' based client together for the next but one
release.

The relevant feature request is here:
http://issues.jalview.org/browse/JAL-1375

Jim.
ps. I know BioJava has entrez fetching capability, but I've not looked
to see how straightforward it would be to separate that code from the
rest of BioJava.

The University of Dundee is a registered Scottish Charity, No: SC015096

···

On Wed Sep 18 10:17:05 2013, David Martin wrote:

hamish_mcwilliam · 18 September 2013 10:47

Hi Jim,

Are there any plans to include fetching of sequences from NCBI? ie by
GI or GenBank accession?

I've been thinking about it - since it's a blocker for just about
everyone who isn't EMBL-centric.

The reason that I didn't push to do it was I thought there was an NCBI
DAS source that took GI numbers, but that seems to have disappeared (or
perhaps I imagined it :q ).

Putting something together isn't particularly difficult, but I've not
got any well tested Java lying around that I could hook up with the
Jalview sequence fetcher. Does anyone on the list have a piece of code
that can retrieve sequences and annotation from Entrez that they'd be
happy to donate to Jalview ? Otherwise, I'll try and get around to
knocking a basic 'HTTP GET' based client together for the next but one
release.

The relevant feature request is here:
http://issues.jalview.org/browse/JAL-1375

Jim.
ps. I know BioJava has entrez fetching capability, but I've not looked
to see how straightforward it would be to separate that code from the
rest of BioJava.

For web services to access data available in NCBI Entrez see E-Utilities:

The E-Utilties provide both REST and SOAP interfaces to Entrez and access to the databases and data formats available from the Entrez web interface.

Note that for JalView you will want to register the application with NCBI (see "Usage Guidelines and Requirements" <A General Introduction to the E-utilities - Entrez Programming Utilities Help - NCBI Bookshelf), also note the usage guidelines described in this section (NCBI will automatically block users who abuse their services).

FWIW E-Utilties is used as one of the data sources for RefSeq data in dbfetch/WSDbfetch.

Please note that INSDC accessions (as used in DDBJ, EMBL-Bank and GenBank) are shared by the member databases as part of the collaboration agreement. So if you have GenBank accessions, these can be used with EMBL-Bank to retrieve the same sequence and features. The GenBank locus names (now rare) and GI numbers are not part of the agreement and thus can only be used with other databases after being resolved to an accession. On the protein side the GenBank CDS translations (GenPept) use the INSDC protein_id accessions and are directly equivalent to data in UniProtKB, this is then supplemented with additional data from UniProtKB (direct protein submissions) and PDB. Again the GI numbers are the only problem since these are not part of sharing agreements with other data providers. However UniParc <UniProt; does include GI numbers, along with many other identifiers, and thus provides a foundation for identifier mapping services such as:

- UniProt.org Database Identifier Mapping: UniProt
- PICR: http://www.ebi.ac.uk/Tools/picr/

Which provide web services for mapping between identifiers from may sources, and thus allow the use of features from another data source for a sequence where the required features are not available in the source database.

All the best,

Hamish

jalviewcrowdadmin · 18 September 2013 11:37

Thanks for following up, Hamish.

For web services to access data available in NCBI Entrez see E-Utilities:

Entrez Programming Utilities Help - NCBI Bookshelf

The E-Utilties provide both REST and SOAP interfaces to Entrez and
access to the databases and data formats available from the Entrez web
interface.

Note that for JalView you will want to register the application with
NCBI (see "Usage Guidelines and Requirements"
<A General Introduction to the E-utilities - Entrez Programming Utilities Help - NCBI Bookshelf),
also note the usage guidelines described in this section (NCBI will
automatically block users who abuse their services).

Yes - its the registration and fair usage management that stopped me in the past. It's ridiculously easy to spawn very large queries with Jalview.

FWIW E-Utilties is used as one of the data sources for RefSeq data in
dbfetch/WSDbfetch.

So - is there a way that dbfetch could act as a proxy for Jalview ? That would make things ridiculously easy.

Please note that INSDC accessions (as used in DDBJ, EMBL-Bank and
GenBank) are shared by the member databases as part of the collaboration
agreement. So if you have GenBank accessions, these can be used with
EMBL-Bank to retrieve the same sequence and features. The GenBank locus
names (now rare) and GI numbers are not part of the agreement and thus
can only be used with other databases after being resolved to an
accession. On the protein side the GenBank CDS translations (GenPept)
use the INSDC protein_id accessions and are directly equivalent to data
in UniProtKB, this is then supplemented with additional data from
UniProtKB (direct protein submissions) and PDB. Again the GI numbers are
the only problem since these are not part of sharing agreements with
other data providers.

Thanks for this useful information, Hamish!

However UniParc
<UniProt; does include GI numbers, along
with many other identifiers, and thus provides a foundation for
identifier mapping services such as:

- UniProt.org Database Identifier Mapping:
UniProt
- PICR: http://www.ebi.ac.uk/Tools/picr/

Which provide web services for mapping between identifiers from may
sources, and thus allow the use of features from another data source for
a sequence where the required features are not available in the source
database.

My personal experience is that the mapping services are 'nearly' synchronized, but sometimes there are differences in the amount of information provided by NCBI and EMBL datasources. I'm hoping that the impending introduction of a direct ENSEMBL connection in Jalview will resolve much of these problems, but that isn't going to be available till Version 3, which will be mid-2014. Until then, we need to look at quick fixes.

Jim.

···

On 18/09/2013 11:47, Hamish McWilliam wrote: