Align general sequences

karen · 28 May 2012 19:14

Hi! I’ve just discovered Jalview, I have a social sciences background, I’m studying life trajectories as sequences of states.

I’m wondering if I could apply Jalview for aligning sequences of a fixed number of symbols (for example 50) each
one representing a life state of one person (A= Birth, B= Start elementary school, C= Start Ballet, D= Get married, E= First child,
F= Get divorced, G= Second marriage, …, &= Quit first job, *= Night courses,… etc).

So each sequence is one person’s life and I have for example 3000 persons.

Could I apply the tree-based grouping of sequences functionality (e.g. Neighbour joining or Average disance) to get similar life trajectories using Jalview?

I hope so!!!

Thanks in advance and my apologies for this naive question…

All the best.
Karen

jalviewcrowdadmin · 30 May 2012 14:43

Hello there Karen.

Hi! I've just discovered Jalview, I have a social sciences background, I'm studying life trajectories as sequences of states.

OK. Sounds interesting !

I'm wondering if I could apply Jalview for aligning sequences of a fixed number of symbols (for example 50) each
one representing a life state of one person (A= Birth, B= Start elementary school, C= Start Ballet, D= Get married, E= First child,
F= Get divorced, G= Second marriage, ..., &= Quit first job, *= Night courses,.. etc).
So each sequence is one person's life and I have for example 3000 persons.

Could I apply the tree-based grouping of sequences functionality (e.g. Neighbour joining or Average disance) to get similar life trajectories using Jalview?
I hope so!!!

You certainly could in principle.. neighbour joining and average distance are both generic tree algorithms that work on a matrix of distances, and you could even use Jalview's PCA function, which is another kind of cluster analysis, but you would first need to align your sequences (if they are not simply based on time points), and you'd also need to be careful about how you encode life states if you used Jalview's built in distance matrix calculations.

Most bioinformatics analysis encode molecular sequences using a standard alphabet, and Jalview applies a filter before analysing the sequences to ensure the analysis doesn't fail because of an unknown symbol. Even worse, if you use the BLOSUM62 score model for tree building, then different symbol matches are scored in different ways because some mutations are more likely than others (kind of like someone starting modern dance rather than ballet, or some other pre-teen activity).

It wouldn't be too hard to adapt Jalview's filters to support a wider range of states, or use a different similarity model, and in fact, its something I'd like it to be able to do in the future. If you, or anyone you know is a java programmer, then I'd happily point out the parts that need to be modified for your needs (the colour schemes wouldn't work for your symbols either, if you have more than 20).

Another alternative, which is a bit more technical, but ultimately more rigorous, would be to use one of the multiple alignment tools that have no built-in rules about sequence symbols. One or two of the programs jalview uses can align generic sequences, and the Notredame group, which produced one of the most accurate alignment programs, developed a tool called SALTT (http://www.tcoffee.org/saltt/). This supports the analysis of generic symbol sequences, and might be just what you need.

Thanks in advance and my apologies for this naive question..

You're welcome - and the question is not naive at all! The underlying algorithms and mathematical problems used in molecular sequence analysis are exactly what you might use for your life sequence analysis problem.. the only differences are in how the symbols are interpreted, and what assumptions about the data have been made. The techniques most relevant to your analysis are hidden markov models - which were originally developed for time series analysis, where some hidden process generates a sequence of observable states, and one wants to model the hidden process, in order - for example - to decide whether one hidden process is similar to the process that generated another series of states.

I hope that helps - there are piles of literature on the subject, and I'm sure that others on the list might be able to suggest some fruitful directions.
Jim.

···

On 28/05/2012 20:14, Karen Keight wrote:

geoff.barton · 30 May 2012 15:07

Aha! I was just about to answer this as well. To add to Jim's comments, as a starting point, you could just try reading in your sequences to Jalview and seeing what it makes of it. You'll need to put your strings in a text file that has one of the formats supported by Jalview - I suggest FASTA as this is pretty simple. Although Jalview is meant for sequences with at most a 20 letter alphabet it will not complain about the non-standard letters and if you use the standard colouring schemes and PID tree calculation you may get something that helps you interpret your strings. I just tried this with some dummy data and the tree made sense even though it was ignoring some of the characters.

This is all predicated by:

1. Your strings are all the same length
2. You do not need to align them optimally - i.e. position 1 always aligns with 1, 2 with 2 and so on.

If you need to align them, then see Jim's comments on tools that can cope with non-protein/DNA/RNA sequences. Likewise, if you want to get serious about the interpretation of the trees you will need to recode to make sure your scoring metric for comparing strings is sensible given the data.

If you don't need to do alignment (i.e. put in insertions/deletions) then there are 101 ways of comparing the strings and scoring the coparison and plotting the results as trees or PCA plots or networks. The statistics package "R" would be a useful tool for this.

As an aside, about 15 years ago someone asked me a similar question - they were working on the pre-Incan Quipu knot system and wanted to compare Quipu to each other by multiple alignment. In the end I think they used an adapted version of Clustal for this since in their case it was clear the Quipu would require insertions/deletions to find common regions.

Have fun!

Geoff.

···

On 30/05/2012 15:43, Jim Procter wrote:

Hello there Karen.

On 28/05/2012 20:14, Karen Keight wrote:

Hi! I've just discovered Jalview, I have a social sciences background,
I'm studying life trajectories as sequences of states.

OK. Sounds interesting !

I'm wondering if I could apply Jalview for aligning sequences of a
fixed number of symbols (for example 50) each
one representing a life state of one person (A= Birth, B= Start
elementary school, C= Start Ballet, D= Get married, E= First child,
F= Get divorced, G= Second marriage, ...,&= Quit first job, *= Night
courses,.. etc).
So each sequence is one person's life and I have for example 3000 persons.

Could I apply the tree-based grouping of sequences functionality (e.g.
Neighbour joining or Average disance) to get similar life
trajectories using Jalview?
I hope so!!!

You certainly could in principle.. neighbour joining and average
distance are both generic tree algorithms that work on a matrix of
distances, and you could even use Jalview's PCA function, which is
another kind of cluster analysis, but you would first need to align your
sequences (if they are not simply based on time points), and you'd also
need to be careful about how you encode life states if you used
Jalview's built in distance matrix calculations.

Most bioinformatics analysis encode molecular sequences using a standard
alphabet, and Jalview applies a filter before analysing the sequences to
ensure the analysis doesn't fail because of an unknown symbol. Even
worse, if you use the BLOSUM62 score model for tree building, then
different symbol matches are scored in different ways because some
mutations are more likely than others (kind of like someone starting
modern dance rather than ballet, or some other pre-teen activity).

It wouldn't be too hard to adapt Jalview's filters to support a wider
range of states, or use a different similarity model, and in fact, its
something I'd like it to be able to do in the future. If you, or anyone
you know is a java programmer, then I'd happily point out the parts that
need to be modified for your needs (the colour schemes wouldn't work for
your symbols either, if you have more than 20).

Another alternative, which is a bit more technical, but ultimately more
rigorous, would be to use one of the multiple alignment tools that have
no built-in rules about sequence symbols. One or two of the programs
jalview uses can align generic sequences, and the Notredame group, which
produced one of the most accurate alignment programs, developed a tool
called SALTT (http://www.tcoffee.org/saltt/). This supports the analysis
of generic symbol sequences, and might be just what you need.

Thanks in advance and my apologies for this naive question..

You're welcome - and the question is not naive at all! The underlying
algorithms and mathematical problems used in molecular sequence analysis
are exactly what you might use for your life sequence analysis problem..
the only differences are in how the symbols are interpreted, and what
assumptions about the data have been made. The techniques most relevant
to your analysis are hidden markov models - which were originally
developed for time series analysis, where some hidden process generates
a sequence of observable states, and one wants to model the hidden
process, in order - for example - to decide whether one hidden process
is similar to the process that generated another series of states.

I hope that helps - there are piles of literature on the subject, and
I'm sure that others on the list might be able to suggest some fruitful
directions.
Jim.

_______________________________________________
Jalview-discuss mailing list
Jalview-discuss@jalview.org
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss

--
Geoff Barton, Professor of Bioinformatics, College of Life Sciences
University of Dundee, Scotland, UK. g.j.barton@dundee.ac.uk
Tel:+44 1382 385860/388731 (Fax:385764) www.compbio.dundee.ac.uk

The University of Dundee is registered Scottish charity: No.SC015096

jalviewcrowdadmin · 3 June 2012 18:14

Hi Karen.

We don’t allow attachments on the list, I’m afraid - but I did take a look at what you sent.

Hi Jim and Geoff… I’ve sent a message to the discussion list and it is waiting for approval because of the attachments.
Below is the message I’ve sent … just in case it does not arrive to the discussion list…

I think that there’s quite a lot of functionality in Jalview that I could apply to my problem and I’ve started but I’m not sure if it is ok so far and how to proceed from now on.

You seem to have done a fair amount of experimentation! I do have a couple of comments:

I’ve been arranging my dataset for processing it in Jalview, I attach it here (testdata.txt) in FASTA format (hopefully)

Now I have 123 participants, each one is a sequence of 600 time points.

At each time point a participant can be in one of 14 states, I used 14 Protein letters (A,R,N,D,C,E,Q,G,H,L,K,M,F,P) to represent my states for Jalview to read it. Of course I selected that 14 letters without following any biological criteria, just picked up a letter for each state.

-For my dataset the P letter represents a NULL state meaning that there is no time point evaluation for the subject , so it is used when needed for every subject to complete the 600 time points.

P is, unfortunately, a fairly bad choice. It doesn’t matter for percent identity, but generally, P is a ‘special’ amino acid, and rarely mutates. See below…

Is it in correct FASTA format?

If jalview read it, and the symbols appear in the alignment window in the way you would expect, then it is correct!

I’ve tried some trees and looks fine for average distance, neighbour distance looks strange…

NJ can often look strange if the sequences are not well-related. In this case, P112 has been chosen as the ‘outgroup’ - the one that appears furthest away from all other sequences, the rest of the individuals seem to fall into two more closely related groups.

What option did you use to calculate each of these ? Percentage identity and BLOSUM62 will give different trees - you’ll need to use the latter if you want distances to account for similar states.

I don’t know if there’s some way to take advantage of the following information, to give some semantics to the clusters:

A and N states represents situations in life that are related , for example both of them refer to EDUCATION

R, C, E, Q, L states represents also similar life situations, for example all of them deal with FAMILY

D, G, H, K, M represent EMPLOYMENT/JOBS

P state represents NULL
-F represents a mixture of states that have no interpretation for the moment.

You will need to recode your states to map similar states to similar amino acids - then you’ll be able to take advantage of the intrinsic amino acid similarities that the conservation and blosum62 measures employ.

The a Venn diagram here: http://www.jalview.org/help/html/misc/aaproperties.html which indicates the various properties shared by different amino acids. If you want to have a ‘NULL’ state, then G is probably the best one to choose - but you can also use ‘-’ - the gap character. Gaps are treated specially, and might actually be closest to what you would consider the 'null’s to indicate.

I think that useful information that I should use comes from the Conservation, Quality and Consensus graphs shown below the sequences
If I understand it well, the Consensus Graph shows for each of my 600 time points the most frequent states and their %
I’m having most Fs at the begining and Ps thereafter.
Could the Quality measure be useful for my sequences?

if you re-encode your states according to the amino acid groupings, then you’ll certainly get some informatino from the quality and conservation measures. Conservation measures the number of common properties for the amino acids in a column, and quality measures the average score for the mutations observed in a column - so unlikely transitions will result in a lower quality score.

The Jmol visualization looks really nice!!! Could it make sense to assign some structure to my sequences and try to visualize them like that?

almost certainly not, I’m afraid

Would it be useful for finding some kind of regularity or relation?
Excuse for the naive question but what do you use Jmol visualization for?

Jmol is a molecular structure viewer. The sequences Jalview normally handles are ‘shorthand’ for biological molecules. Similar sequences - particularly evolutionarily related ones - have a similar 3D molecular structure, and can often perform the same kinds of chemical interactions (because they have similar shapes).

I’ve been programming in Java and I think that it would be really nice if I could apply/adapt some of the functionalities to my problem.

OK. You would certainly be able to create a special set of parameters for your symbols. The matrix encoding the amino acid groupings is hard-coded into Jalview’s source - so it would be straightforward for you to change the groupings to better reflect the way you are using the symbols. Take a look at the various matrices in this file:

http://source.jalview.org/gitweb/?p=jalview.git;a=blob_plain;f=src/jalview/schemes/ResidueProperties.java;hb=refs/heads/master

The basic philosophy here is that jalview maps the letter in each sequence to an amino acid or nucleotide index which is used to map in to the various score matrices and property vectors (the indexes are given by aaIndex and nucleotideIndex). There are two types of score models used here, whilst the rest of the file contains hard-coded colour maps for the built in colour schemes, and look up tables for converting indexes into text for displaying the name of the amino acid or nucleotide to the user.
The two types of models are substitution matrices - which reflect the similarity between two symbols, and property matrices which allows ‘conservation’ to be calculated, in order to reflect the number of properties that are different for the symbols in a particular column. The conservation may be very useful to you - see the paper linked to in this post for more explanation: http://www.compbio.dundee.ac.uk/pipermail/jalview-discuss/2012-May/000811.html

Jim.

geoff.barton · 4 June 2012 14:04

HI Karen,

I know Jim has replied in some detail about how you might use Jalview to help in your analysis, so I won’t add very much here. While Jalview could perhaps be adapted to help with your research I think you really should talk to someone/collaborate with someone who has experience of multivariate data analysis in order to explain the question you are trying to address and thus devise the most appropriate analysis solution. Since your strings are all the same length and do not need alignment, I suggested in my last email that you look at R and the various mutivariate techniques it implements. You also need to decide how best to represent the transitions between the states in your system. There are many possible ways to do this, but while my group (including Jim) has expertise in this sort of thing, your problem area is a long way from what we do and it would be hard for us to justify a collaboration.

So, please use Jalview if you like to visualise your data, but do find someone in your institution who can sit down with you and discuss the best way to go about answering the questions you are interested in answering, particularly with respect to clustering.

I wish you all the best in your research - please let us know when you find a good solution and particularly where we should look to see the final publication of your work!

With every good wish,

Geoff.

···

-- 
Geoff Barton, Professor of Bioinformatics,  College of Life Sciences
University of Dundee, Scotland, UK.          [g.j.barton@dundee.ac.uk](mailto:g.j.barton@dundee.ac.uk)
Tel:+44 1382 385860/388731 (Fax:385764)     [www.compbio.dundee.ac.uk](http://www.compbio.dundee.ac.uk)

The University of Dundee is registered Scottish charity: No.SC015096