Hi Karen.
We don’t allow attachments on the list, I’m afraid - but I did take a look at what you sent.
Hi Jim and Geoff… I’ve sent a message to the discussion list and it is waiting for approval because of the attachments.
Below is the message I’ve sent … just in case it does not arrive to the discussion list…
I think that there’s quite a lot of functionality in Jalview that I could apply to my problem and I’ve started but I’m not sure if it is ok so far and how to proceed from now on.
You seem to have done a fair amount of experimentation! I do have a couple of comments:
I’ve been arranging my dataset for processing it in Jalview, I attach it here (testdata.txt) in FASTA format (hopefully)
- Now I have 123 participants, each one is a sequence of 600 time points.
- At each time point a participant can be in one of 14 states, I used 14 Protein letters (A,R,N,D,C,E,Q,G,H,L,K,M,F,P) to represent my states for Jalview to read it. Of course I selected that 14 letters without following any biological criteria, just picked up a letter for each state.
-For my dataset the P letter represents a NULL state meaning that there is no time point evaluation for the subject , so it is used when needed for every subject to complete the 600 time points.
P is, unfortunately, a fairly bad choice. It doesn’t matter for percent identity, but generally, P is a ‘special’ amino acid, and rarely mutates. See below…
Is it in correct FASTA format?
If jalview read it, and the symbols appear in the alignment window in the way you would expect, then it is correct!
I’ve tried some trees and looks fine for average distance, neighbour distance looks strange…
NJ can often look strange if the sequences are not well-related. In this case, P112 has been chosen as the ‘outgroup’ - the one that appears furthest away from all other sequences, the rest of the individuals seem to fall into two more closely related groups.
What option did you use to calculate each of these ? Percentage identity and BLOSUM62 will give different trees - you’ll need to use the latter if you want distances to account for similar states.
I don’t know if there’s some way to take advantage of the following information, to give some semantics to the clusters:
- A and N states represents situations in life that are related , for example both of them refer to EDUCATION
- R, C, E, Q, L states represents also similar life situations, for example all of them deal with FAMILY
- D, G, H, K, M represent EMPLOYMENT/JOBS
- P state represents NULL
-F represents a mixture of states that have no interpretation for the moment.
You will need to recode your states to map similar states to similar amino acids - then you’ll be able to take advantage of the intrinsic amino acid similarities that the conservation and blosum62 measures employ.
The a Venn diagram here: http://www.jalview.org/help/html/misc/aaproperties.html which indicates the various properties shared by different amino acids. If you want to have a ‘NULL’ state, then G is probably the best one to choose - but you can also use ‘-’ - the gap character. Gaps are treated specially, and might actually be closest to what you would consider the 'null’s to indicate.
I think that useful information that I should use comes from the Conservation, Quality and Consensus graphs shown below the sequences
If I understand it well, the Consensus Graph shows for each of my 600 time points the most frequent states and their %
I’m having most Fs at the begining and Ps thereafter.
Could the Quality measure be useful for my sequences?
if you re-encode your states according to the amino acid groupings, then you’ll certainly get some informatino from the quality and conservation measures. Conservation measures the number of common properties for the amino acids in a column, and quality measures the average score for the mutations observed in a column - so unlikely transitions will result in a lower quality score.
The Jmol visualization looks really nice!!! Could it make sense to assign some structure to my sequences and try to visualize them like that?
almost certainly not, I’m afraid
Would it be useful for finding some kind of regularity or relation?
Excuse for the naive question but what do you use Jmol visualization for?
Jmol is a molecular structure viewer. The sequences Jalview normally handles are ‘shorthand’ for biological molecules. Similar sequences - particularly evolutionarily related ones - have a similar 3D molecular structure, and can often perform the same kinds of chemical interactions (because they have similar shapes).
I’ve been programming in Java and I think that it would be really nice if I could apply/adapt some of the functionalities to my problem.
OK. You would certainly be able to create a special set of parameters for your symbols. The matrix encoding the amino acid groupings is hard-coded into Jalview’s source - so it would be straightforward for you to change the groupings to better reflect the way you are using the symbols. Take a look at the various matrices in this file:
http://source.jalview.org/gitweb/?p=jalview.git;a=blob_plain;f=src/jalview/schemes/ResidueProperties.java;hb=refs/heads/master
The basic philosophy here is that jalview maps the letter in each sequence to an amino acid or nucleotide index which is used to map in to the various score matrices and property vectors (the indexes are given by aaIndex and nucleotideIndex). There are two types of score models used here, whilst the rest of the file contains hard-coded colour maps for the built in colour schemes, and look up tables for converting indexes into text for displaying the name of the amino acid or nucleotide to the user.
The two types of models are substitution matrices - which reflect the similarity between two symbols, and property matrices which allows ‘conservation’ to be calculated, in order to reflect the number of properties that are different for the symbols in a particular column. The conservation may be very useful to you - see the paper linked to in this post for more explanation: http://www.compbio.dundee.ac.uk/pipermail/jalview-discuss/2012-May/000811.html
Jim.