dbSNP, GenBank, integration framework and i18n management

jalviewcrowdadmin · 13 January 2014 12:02

Hi David. I've cced this to the development list, so everyone can read about what you've been up to.

I've been working on this, taking a look at Stockholm and Embl
processes but I cannot figure out which to do with the information,
once I have loaded SNP file and GenBank information into my own
classes hierarchy.

hmm. Just a comment here: you really should avoid setting up a class hierarchy if you can avoid it - the parsing overhead from creating lots of objects is quite considerable. Jalview has a quite extensive annotation datamodel, which will work for non-hierarchical sequence features - but not for more complex compound/hierarchical features (http://issues.jalview.org/browse/JAL-1191). For the full gory details of this type of annotation, you need to read the documentation here:
International Nucleotide Sequence Database Collaboration

however - don't start hacking on this until we talk - there are some very good examples of how to implement complex/compound feature datamodels, and I'd prefer it if we first analyse those and work out which one fits Jalview's needs best.

_SNP loading
I've been able to set up Castor Maven plugin so that I can generate a
Java library only customizing a pom.xml and to include a XSD (or set
of XSD). In this way, I think we'll be able to widen Jalview data load
from multiple sources quite easily. I can work on that (just tell me
the XSD) but I really need to fully understand Jalview datamodel. An
E-R diagram or similar will be useful in this sense.

eek! We already have the castor source generation machinery bundled with Jalview. By using a maven plugin, you risk breaking compatibility with the bundled version of castor, which is NOT good. If you must use castor XSD->Java, then take a look at the 'castorbinding' task in build.xml - this already includes a set of XSDs that create java bindings for the Jalview project and colourscheme files which are critical for the jalview desktop.

You should also bear in mind that currrently, fileformats dependent on classes autogenerated with castor will not be available in the applet, since XML parsing libraries are considered too heavyweight to ship to the browser. This is the most significant reason for not using XSD->Java object mapping, but there are other reasons for avoiding it: e.g. when working with large XML files, stream XML processing avoids the memory and object creation overhead incurred by creating an object representation of elements in the document.

Re understanding Jalview's datamodel.. I know an ER diagram would help, but it will only get you so far, since you also need to think about how the data that you are trying to import into Jalview is structured (remember, the XML format may not necessarily correspond to the way that the data might be most usefully be handled in Jalview).

_GenBank
_
I have parsed the file to get sequences and features. In this version
of the patch (not the one attached at JIRA) I think I can translate
sequences from file to Jalview sequences (please, check) but I don't
know what happens with file headers and features. How can I inject
this into Jalview datamodel? Which is the correspondence between them?

We are going to talk through this on our next google hangout.

_Integration framework

_
I've been thinking about how to integrate Jalview with other tools and
systems. At e-learning domain there are several interesting
initiatives whose approximations are worth to be examined.
Take a look at this two: JISC E-learning Framework
(http://www.jisc.ac.uk/whatwedo/programmes/elearningframework.aspx)
and OKI (Open Knowledge Initiative - Wikipedia). Both
of them are based on the concept of service and service interfaces but
don't force to use any particular implementation. This offer better
interoperability between platforms and this is a good change to make
tools adoption to grow. I'm going to work on this idea with a
colleague, trying to put this ideas in a paper to see if it's accepted
at RCIS (IEEE RCIS 2014). If you are interested at
participating with us, let me know and I'll him.
If you think this is a good idea, probably we can discuss this in
detail and even open the discussion to more people.

You are quite right in recognising that Jalview would benefit from being part of an information integration framework. In fact, Jalview already includes a couple. VAMSAS is a prototype data and application integration framework for bioinformatics data that I developed in collaboration with some other groups. DAS is a much more widespread data integration framework based on XML/REST services that has been around since 2001 (Main Page). It was developed for sequences and sequence features on genomes, but has been adapted to work with other types of data.

As you might imagine, I'm interested in integration frameworks, and would be interested discussing ideas with you colleague, though I should say now that I already have enough deadlines for this year!

_i18n management
_
I was wondering if it is possible to create to separate components at
JIRA, one for bugs/FR/etc. related with i18n and other one for
translations. In this way, if the issue is, for example, a mistake in
property bundle or a new language contribution, we'll marked them as
Translation related. On the contrary, if the issue is something that
doesn't work when you switch the language from English to French,
we'll specify it as Internationalization related. What do you think
about that?

Done. Translations component is here.
http://issues.jalview.org/browse/JAL/component/10780

Jim.

···

On Sat Jan 11 12:00:37 2014, David Roldán Martínez wrote: