[jalview 3] Extending Jalview's Sequence Feature Representation for ENSEMBL, et al.

Hi all.

David Roldán Martínez, who's currently working on a GenBank parser in Jalview, noticed some problems with the way Jalview represents Sequence Features (see below) - this prompted me to write down some thoughts on what we need for Jalview 3.

This object [jalview.datamodel.SequenceFeature] has a begin and an end positions refering to the positional nature of the feature. However, what happens when this location is not a range but a join or a complement of range, as it can happen in GenBank (not sure what happens in other file formats)?. Or is there any means to translate these kinds of ranges to begin..end one?

This deficiency is mostly covered by http://issues.jalview.org/browse/JAL-1191 - support for hierarchies of features and feature groups, but there is a complication since Jalview has another way of representing coding region annotation.

For a series of coding regions in ENA records, Jalview creates jalview.datamodel.SequenceFeature objects to highlight the regions, and also constructs a jalview.datamodel.Mapping object which associates coding positions on the ENA dataset sequence with positions on the derived sequence (which is usually protein, but could be a transcript sequence). These Mapping objects are attached as DBRefEntry cross-reference objects on each sequence, and processed by the routines in jalview.analysis.CrossRef.

Additionally, there are some fields (i.e. otherDetails) that are publicly "available" when it is a common practice to hide them and access through getter/setter.

This is something that needs to be looked at carefully. We originally avoided getter/setters for some fields in order to avoid overheads when accessing data from some fields. However, for OSGI, we will need to create an interface for sequence feature objects, so getter/setters will be unavoidable.




As far as I see it, there are two issues to focus on.

1. How to represent complex features

Most systems employ some kind of linking model (e.g. 'parent'/'children'/'edge' type links).

This fits well with GFF3: http://gmod.org/wiki/GFF3
To summarise:
* A 'Parent' field links a feature to a parent feature that has a matching 'ID' field.
* Several different features sharing the same ID field are 'siblings'.

The question is, how should these be managed in Jalview. Currently, Jalview hold features as simple lists, which aren't so efficient (see nested containment lists - http://www.jalview.org/pipermail/jalview-dev/2012-June/000220.html ). Complex relationships can be represented via the ID and parent fields stored in 'otherDetails' (which are parsed from GFF files) but nothing operates on them.

Ultimately, a family of SequenceFeatureI and FeatureCollectionI type interfaces are needed to interact with and manipulate simple and complex features. These need to be scalable, since the associated sequence may be a chromosome or genomic contig, and compatible with database/file backed array storage.

2. What needs to change in the GUI/rendering system to visualise and interact with complex features.

This is a more complex issue. There are a bunch of issues related to bulk editing of individual features, searching features, etc. Ideally, hierarchical features should be handled in a similar way. The first question to ask here is: what are the must have feature display and editing capabilities ?

OK - that's my braindump for the moment. I don't plan on introducing any changes to the datamodel for 2.8.1 or 2.8.2, but we do need changes for v3 that will support import/export of data from ENSEMBL, and GFF3. Any thoughts ?