Hi Jim,
Are there any plans to include fetching of sequences from NCBI? ie by
GI or GenBank accession?
I've been thinking about it - since it's a blocker for just about
everyone who isn't EMBL-centric.
The reason that I didn't push to do it was I thought there was an NCBI
DAS source that took GI numbers, but that seems to have disappeared (or
perhaps I imagined it :q ).
Putting something together isn't particularly difficult, but I've not
got any well tested Java lying around that I could hook up with the
Jalview sequence fetcher. Does anyone on the list have a piece of code
that can retrieve sequences and annotation from Entrez that they'd be
happy to donate to Jalview ? Otherwise, I'll try and get around to
knocking a basic 'HTTP GET' based client together for the next but one
release.
The relevant feature request is here:
http://issues.jalview.org/browse/JAL-1375
Jim.
ps. I know BioJava has entrez fetching capability, but I've not looked
to see how straightforward it would be to separate that code from the
rest of BioJava.
For web services to access data available in NCBI Entrez see E-Utilities:
The E-Utilties provide both REST and SOAP interfaces to Entrez and access to the databases and data formats available from the Entrez web interface.
Note that for JalView you will want to register the application with NCBI (see "Usage Guidelines and Requirements" <A General Introduction to the E-utilities - Entrez Programming Utilities Help - NCBI Bookshelf), also note the usage guidelines described in this section (NCBI will automatically block users who abuse their services).
FWIW E-Utilties is used as one of the data sources for RefSeq data in dbfetch/WSDbfetch.
Please note that INSDC accessions (as used in DDBJ, EMBL-Bank and GenBank) are shared by the member databases as part of the collaboration agreement. So if you have GenBank accessions, these can be used with EMBL-Bank to retrieve the same sequence and features. The GenBank locus names (now rare) and GI numbers are not part of the agreement and thus can only be used with other databases after being resolved to an accession. On the protein side the GenBank CDS translations (GenPept) use the INSDC protein_id accessions and are directly equivalent to data in UniProtKB, this is then supplemented with additional data from UniProtKB (direct protein submissions) and PDB. Again the GI numbers are the only problem since these are not part of sharing agreements with other data providers. However UniParc <UniProt; does include GI numbers, along with many other identifiers, and thus provides a foundation for identifier mapping services such as:
- UniProt.org Database Identifier Mapping: UniProt
- PICR: http://www.ebi.ac.uk/Tools/picr/
Which provide web services for mapping between identifiers from may sources, and thus allow the use of features from another data source for a sequence where the required features are not available in the source database.
All the best,
Hamish