rfam web site URL and refactoring advice

Hi Lauren.

The notes I made yesterday about Rfam retrieval URLs can be found at the end of the email. But first, the advice about refactoring the Pfam database fetcher.

Currently, jalview has a set of classes for Pfam retrieval that look like this:

jalview.ws.dbsources.Pfam - abstract class that contains the code to retrieve, parse and annotate the alignment retrieved from a URL source. Currently, it is hardwired to annotate each retrieved sequence in the alignment with the 'PFAM' database reference used to retrieve the alignment.

jalview.ws.dbsources.PfamFull : concrete class extending Pfam that provides the URL for retrieving the whole family
jalview.ws.dbsources.PfamSeed : concrete class extending Pfam for retrieving just the seed alignment

The PfamFull and PfamSeed classes are registered as database sources in the jalview.ws.SequenceFetcher() constructor - note that only a reference to the class is passed to the addDBRefSourceImpl function.

Adding an RfamSeed and RfamFull class:
The ideal route would be too re-use as much of the existing code in jalview.ws.dbsources.Pfam as possible. To do this, you need to generalise the jalview.ws.Pfam class. You can do this in a couple of ways, I'd suggest trying out eclipse's refactoring tool:

1. Select Pfam at the beginnning of the class definition in the jalview.ws.Pfam class.
2. In the refactoring menu, select 'Extract superclass'.
3. The dialog will let you enter a new class name (Xfam), and also allow you to select methods in the Pfam class that you want to either 'pull up' or define as abstract in the new superclass. All you need to pull up is the getSequenceRecords method, but you'll also need to define the getPFAMURL and getDbVersion() methods as abstract in Xfam - this is because they are mentioned in the getSequenceRecords method (try leaving either or both of them out and see what happens).
4. Once you've selected the methods that should be moved in to the new superclass, hit the next button, and you'll be shown various previews, and also told if any of the changes you make cause any errors. You can always go back to change the options you selected.
5. finally, cross your fingers and hit finish to do the refactoring.

The code will all run as before, since refactoring a class hierarchy modifies the structure without affecting the actual run-time behaviour of the code that is executed.

The new class structure looks like this:

Xfam <- Pfam
Pfam <- PfamFull
Pfam <- PfamSeed
(this is UML notation. The <- means 'superclass' from left to right, or 'extends from' when reading from right to left).

What you will then need to do is make three new classes:

Xfam <- Rfam
Rfam <- RfamFull
Rfam <- RfamSeed

However, this means that Xfam should contain only methods relevant to both Rfam and Pfam, and you'll notice that Xfam is still contaminated with a reference to the PFAM database accession label. The way to fix this is:

1. Introduce a new abstract method in Xfam, and replace the references to DBRefSource.PFAM in the getSequenceRecords :

abstract String getXfamSource();

2. Each specific Xfam source family should implement this, for example, in jalview.ws.dbsources.Pfam:

public String getXfamSource() { return jalview.datamodel.DBRefSource.PFAM; }

will define the correct parent reference for all the PFAM family sources.

Ok. It sounds long winded, but that's because I've spelled out why you need to do each operation. It actually took me less than five minutes to refactor and add in the new abstract method, as opposed to a lot of manual copy and pasting, find/replaces and renaming of files. Note - I've not had to touch the PfamFull or PfamSeed classes at all. In principle, all you'll need to do is create new classes for the abstract class Rfam, and then the concrete classes RfamFull and RfamSeed, using the eclipse 'New class' wizard, and then fill in the method code specific to each class.

==== Retrieving Rfam families via a Rest web service

Base url for html is: http://rfam.janelia.org/

This is the form from janelia that allows you to download the alignment:

<FORM method="POST" action="/cgi-bin/getalignment">
<INPUT type="radio" name="type" value="seed" CHECKED> Seed (5 sequences)<br />

<INPUT type="radio" name="type" value="full"> Full (45 sequences)
<br />
Format:
<SELECT name="fmt" size=1>
<OPTION value="stockholm" SELECTED>Stockholm
<OPTION value="text">Plain text
<OPTION value="jalview">Jalview java viewer
<OPTION value="msf">GCG MSF format
<OPTION value="afasta">Aligned FASTA format
</SELECT>
<br />
<INPUT type="hidden" name="name" value="DsrA">
<INPUT type="submit" value="Retrieve alignment">
</FORM>

{ using the http://en.wikipedia.org/wiki/Common_Gateway_Interface URL 'GET' format to assemble a QUERY_STRING }

the angle brackets denote places where the class needs to provide info on the query:
http://rfam.janelia.org/cgi-bin/getalignment?type=<seed|full>&fmt=stockholm&name=<familyname>

···

=====

ok. the groovy scripts will come in another email, and I'll talk to you Thursday am!
Jim.

--
-------------------------------------------------------------------
J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.

Hi Jim,

Thanks for the instructions! I've finished adding Rfam retrieval and
figured out how to change the "Fetch Sequences" menu to include Rfam.
I was able to retrieve the example from Rfam through Jalview. I've
committed these changes.

In SequenceFetcher.java in package jalview.ws, I called
addDBRefSourceImpl() for RfamSeed and RfamFull to enable calling Rfam
sequence retrieval.

I also added RFAM to DOMAINDBS in DBRefSource.java, but this didn't
seem to necessary. I think that I need to add Rfam in other places,
such as readable file formats..what do you think?

Best,

Lauren

···

On Wed, Aug 4, 2010 at 3:51 AM, Jim Procter <jprocter@compbio.dundee.ac.uk> wrote:

Hi Lauren.

The notes I made yesterday about Rfam retrieval URLs can be found at the end
of the email. But first, the advice about refactoring the Pfam database
fetcher.

Currently, jalview has a set of classes for Pfam retrieval that look like
this:

jalview.ws.dbsources.Pfam - abstract class that contains the code to
retrieve, parse and annotate the alignment retrieved from a URL source.
Currently, it is hardwired to annotate each retrieved sequence in the
alignment with the 'PFAM' database reference used to retrieve the alignment.

jalview.ws.dbsources.PfamFull : concrete class extending Pfam that provides
the URL for retrieving the whole family
jalview.ws.dbsources.PfamSeed : concrete class extending Pfam for retrieving
just the seed alignment

The PfamFull and PfamSeed classes are registered as database sources in the
jalview.ws.SequenceFetcher() constructor - note that only a reference to the
class is passed to the addDBRefSourceImpl function.

Adding an RfamSeed and RfamFull class:
The ideal route would be too re-use as much of the existing code in
jalview.ws.dbsources.Pfam as possible. To do this, you need to generalise
the jalview.ws.Pfam class. You can do this in a couple of ways, I'd suggest
trying out eclipse's refactoring tool:

1. Select Pfam at the beginnning of the class definition in the
jalview.ws.Pfam class.
2. In the refactoring menu, select 'Extract superclass'.
3. The dialog will let you enter a new class name (Xfam), and also allow you
to select methods in the Pfam class that you want to either 'pull up' or
define as abstract in the new superclass. All you need to pull up is the
getSequenceRecords method, but you'll also need to define the getPFAMURL and
getDbVersion() methods as abstract in Xfam - this is because they are
mentioned in the getSequenceRecords method (try leaving either or both of
them out and see what happens).
4. Once you've selected the methods that should be moved in to the new
superclass, hit the next button, and you'll be shown various previews, and
also told if any of the changes you make cause any errors. You can always go
back to change the options you selected.
5. finally, cross your fingers and hit finish to do the refactoring.

The code will all run as before, since refactoring a class hierarchy
modifies the structure without affecting the actual run-time behaviour of
the code that is executed.

The new class structure looks like this:

Xfam <- Pfam
Pfam <- PfamFull
Pfam <- PfamSeed
(this is UML notation. The <- means 'superclass' from left to right, or
'extends from' when reading from right to left).

What you will then need to do is make three new classes:

Xfam <- Rfam
Rfam <- RfamFull
Rfam <- RfamSeed

However, this means that Xfam should contain only methods relevant to both
Rfam and Pfam, and you'll notice that Xfam is still contaminated with a
reference to the PFAM database accession label. The way to fix this is:

1. Introduce a new abstract method in Xfam, and replace the references to
DBRefSource.PFAM in the getSequenceRecords :

abstract String getXfamSource();

2. Each specific Xfam source family should implement this, for example, in
jalview.ws.dbsources.Pfam:

public String getXfamSource() { return jalview.datamodel.DBRefSource.PFAM; }

will define the correct parent reference for all the PFAM family sources.

Ok. It sounds long winded, but that's because I've spelled out why you need
to do each operation. It actually took me less than five minutes to refactor
and add in the new abstract method, as opposed to a lot of manual copy and
pasting, find/replaces and renaming of files. Note - I've not had to touch
the PfamFull or PfamSeed classes at all. In principle, all you'll need to do
is create new classes for the abstract class Rfam, and then the concrete
classes RfamFull and RfamSeed, using the eclipse 'New class' wizard, and
then fill in the method code specific to each class.

==== Retrieving Rfam families via a Rest web service

Base url for html is: http://rfam.janelia.org/

This is the form from janelia that allows you to download the alignment:

<FORM method="POST" action="/cgi-bin/getalignment">
<INPUT type="radio" name="type" value="seed" CHECKED> Seed (5 sequences)<br
/>

<INPUT type="radio" name="type" value="full"> Full (45 sequences)
<br />
Format:
<SELECT name="fmt" size=1>
<OPTION value="stockholm" SELECTED>Stockholm
<OPTION value="text">Plain text
<OPTION value="jalview">Jalview java viewer
<OPTION value="msf">GCG MSF format
<OPTION value="afasta">Aligned FASTA format
</SELECT>
<br />
<INPUT type="hidden" name="name" value="DsrA">
<INPUT type="submit" value="Retrieve alignment">
</FORM>

{ using the Common Gateway Interface - Wikipedia URL 'GET'
format to assemble a QUERY_STRING }

the angle brackets denote places where the class needs to provide info on
the query:
http://rfam.janelia.org/cgi-bin/getalignment?type=&lt;seed|full&gt;&amp;fmt=stockholm&amp;name=&lt;familyname&gt;

=====

ok. the groovy scripts will come in another email, and I'll talk to you
Thursday am!
Jim.

--
-------------------------------------------------------------------
J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.

Hi Lauren.

Thanks for the instructions! I've finished adding Rfam retrieval and
figured out how to change the "Fetch Sequences" menu to include Rfam.
I was able to retrieve the example from Rfam through Jalview. I've
committed these changes.

That's great. I've just taken a look - and it all seems fine :slight_smile:

In SequenceFetcher.java in package jalview.ws, I called
addDBRefSourceImpl() for RfamSeed and RfamFull to enable calling Rfam
sequence retrieval.

That's what was needed.

I also added RFAM to DOMAINDBS in DBRefSource.java, but this didn't
seem to necessary.

Ah. good point - I need to think about that a bit further. It does no harm to do this, so leave it in for completeness :slight_smile:

   I think that I need to add Rfam in other places,
such as readable file formats..what do you think?

Nope. not at all. Rfam isn't a file format, so it's not relevant to anything in the jalview.io package.

Still working on the groovy stuff - got distracted by 3 hours of teleconferences yesterday!
Jim.

···

On 05/08/2010 01:14, Lauren Lui wrote:

--
-------------------------------------------------------------------
J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.