Some suggestions to do with 'datasets'

Hi folks,

We have datamodel classes for Alignment and Sequence which may (respectively) contain an (Alignment or Sequence) dataset.

I’ve had long-standing confusion in understanding when an object is or isn’t (or has or hasn’t) a dataset.

A couple of suggestions that might help with this.

  1. I’d like to propose adding a flag / method in each class / interface

boolean isDataset

so we can identify explicitly when the object is a dataset.

This would be safer than the method currently used in some places

“if dataset is null then this doesn’t have a dataset, so it must be a dataset”

The problem is that that doesn’t distinguish between a dataset, and a ‘parent’ that hasn’t yet created a dataset.

This is giving a lot of confusion in trying to maintain invariants like:

  • sequence features, dbrefs and pdbids should be on the sequence dataset, not the sequence

  • mappings (AlignedCodonFrame) should be on the alignment dataset (currently some are, some aren’t!)

  • copy constructors should preserve invariants (whether invoked for a dataset or its parent) (a bit broken for Sequence)

  1. enforce invariants in the classes (encapsulated); for example

Sequence.addSequenceFeature(sf) {

if (dataset != null) {

dataset.addSequenceFeature(sf);

} // and similarly for getSequenceFeature()

This is currently done this way for getSequenceFeatures() and getDbRef() (differently) but not getPDBEntry(), and not for adders/setters.

In other words, make it the responsibility of the data model classes, not their clients, to use the dataset where available.

This would simplify a lot of calling code that currently checks for existence of a dataset.

  1. abolish

Alignment.setDataset(null)

and replace it with (a new interface method)

AlignmentI.createDataset()

This would not change behaviour, but would be consistent with Sequence, make the code intention clear, and make calls to this traceable.

  1. check for (and reject) circular references in Alignment.setDataset(), Sequence.setDataset()

That would mean that recursive calls (like at 2 above) are safe from infinite loops and we don’t need to check for this in these methods.

Mungo

The University of Dundee is a registered Scottish Charity, No: SC015096

···

Mungo Carstairs
Jalview Computational Scientist
The Barton Group
Division of Computational Biology
School of Life Sciences
University of Dundee, Dundee, Scotland, UK.
www.jalview.org
www.compbio.dundee.ac.uk

Hi Mungo - this all seems reasonable, but we should conduct a bit of
argumentation to try and get the interface/package architecture right.

We have datamodel classes for Alignment and Sequence which may
(respectively) contain an (Alignment or Sequence) dataset.

I've had long-standing confusion in understanding when an object is or
isn't (or has or hasn't) a dataset.

That's because JalviewLite doesn't have datasets (or at least, didn't
until 2.9 was released), but the Jalview Desktop does. To properly
resolve the architecture we need to decide if JalviewLite will have
dataset objects or not.

1) I'd like to propose adding a flag / method in each class / interface

   *boolean isDataset*

so we can identify explicitly when the object is a dataset.

This would be safer than the method currently used in some places

   "if dataset is null then this doesn't have a dataset, so it must
/be/ a dataset"

The problem is that that doesn't distinguish between a dataset, and a
'parent' that hasn't yet created a dataset.

that's actually context specific. Code handling external data sources by
definition know when they have created a dataset or have been given one.

This is giving a lot of confusion in trying to maintain invariants like:

- sequence features, dbrefs and pdbids should be on the sequence
dataset, not the sequence

Your previous work on this placed all of these on to the Dataset
sequence (with any derived sequence proxying to the dataset). Mostly,
that's fine, but read on...

- mappings (AlignedCodonFrame) should be on the alignment dataset
(currently some are, some aren't!)

This needs to be treated carefully. 'Reference' mappings either map
between defined coordinate spaces (as defined by a DBref, so by
definition live on the Dataset), but mappings can also exist on the
basis of some data transformation (e.g. a translation). We currently
treat manual translation as a dataset level operation, rather than an
alignment specific operation - but there are other mapping like
constructs - specifically a molecular model, which actually depend on
the alignment.

- copy constructors should preserve invariants (whether invoked for a
dataset or its parent) (a bit broken for Sequence)

This isn't right IMHO (and not the bit about being broken for Sequence)
* datasets are singletons. Making a copy of the dataset makes a new
separate dataset.
* making a copy of a derived instance has two cases
- 1) you make a copy preserving all 'invariants'... (I think references
are a better term here than invariants, but YMMV)
- 2) you derive a copy from a parent .. this is also multifaceted,
creating a sequence to be used in an alignment is one derivation,
another is copying a subsequence from one alignment and manipulating it.
The second is perhaps not tractable, but is one way of preserving the
context when some action (annotation, edit, alignment, analysis) was
performed. That was always the goal with the SequenceI nested dataset
referencing model (and the reason that annotation could in theory be
created on dataset and alignment sequences).

2) enforce invariants /in the classes (encapsulated)/; for example

*Sequence.addSequenceFeature(sf) {*

* if (dataset != null) {*

* dataset.addSequenceFeature(sf);*

*} * // and similarly for getSequenceFeature()

This is currently done this way for getSequenceFeatures() and getDbRef()
(differently) but not getPDBEntry(), and not for adders/setters.

again - this is a result of the dual-context behaviour required for
reusing the datamodel code in the applet and application.

In other words, make it the responsibility of the data model classes,
not their clients, to use the dataset where available.

This would simplify a lot of calling code that currently checks for
existence of a dataset.

hmm. not sure it would simplify anything in the gui or applet gui
packages - but I take your point.

I'd actually be more in favour of introducing an explicit
class/interface signature that distinguished dataset specific operations
from generic SequenceI and AlignmentI operations. For instance, it is
currently possible to insert gaps into a dataset sequence.. but that
should be forbidden, since a dataset sequence is by definition one
without any gaps - but that's only enforced at construction time.

3) abolish

* Alignment.setDataset(null)*

and replace it with (a new interface method)

* AlignmentI.createDataset()*

OK. There will be a couple of gotchas though - since a null dataset
reference is a semaphore relied on by the Vamsas and jalview XML import
routines to trigger dataset creation.

This would not change behaviour, but would be consistent with Sequence,
make the code intention clear, and make calls to this traceable.

Fair enough.

4) check for (and reject) circular references in Alignment.setDataset(),
Sequence.setDataset()

I feel that #2 and #3 are somehow in conflict with this slightly. I'd
really prefer it if we could design out the possibility of creating
circular references in the first place.

That would mean that recursive calls (like at 2 above) are safe from
infinite loops and we don't need to check for this in these methods.

I see the reason - but cyclic dataset references are an implementation
error anyhow, so we shouldn't have to ever check for them except when
debugging new code.

let the discussion continue !!!!
Jim..

···

On 27/11/2015 16:26, Mungo Carstairs (Staff) wrote:

--
-------------------------------------------------------------------
Dr JB Procter, Jalview Coordinator, The Barton Group
Division of Computational Biology, College of Life Sciences
University of Dundee, Dundee DD1 5EH, UK.
+44 1382 388734 | www.jalview.org | www.compbio.dundee.ac.uk