For Large sequence alignments 10,000s of sequences the PCA calculation seems to require a lot of memory but is not particularly CPU intensive and takes a number of days. I was wondering if there is a practical limit to the number of sequences for which a PCA can be calculated in Jalview? I was wondering if there is anyway to estimate the length of time the calculation could take?
Adrian
Hi Adrian.
You are absolutely correct that Jalview’s PCA calculation is quite inefficient. We haven’t really been focused on making it more memory efficient or faster because we always planned replace the in-app calculation with a web service, which is something we can now do more easily with the launch of slivka.
It would certainly be possible to perform some benchmarking - I would expect Jalview’s calculation to be fairly predictable in its performance barring issues due to the JVM’s garbage collector. When @morellthomas added PaSiMap they also implemented a progress bar/estimator which we could look at improving.
Please get in contact direct via my Dundee email if you’d like to explore other options..!
Jim
I got a chance to benchmark my problem (650 columns * sequences) and model the length of time the calculation took to perform a basic PCA on a windows 11 computer. As the calculation is not CPU intensive the length of time to take the calculation is probably informative.
Time in seconds = 1.47668496473e-8 * sequences ^ 3.08800867902526
So time is proportional to the cube of the number of sequences. This table gives an idea how this plays out.
| Time |
Sequences |
|
|
| 1 minute |
1300 |
|
|
| 1 hour |
4900 |
|
|
| 1/2 day |
10900 |
|
|
| 1 days |
13600 |
|
|
| 1 week |
25600 |
|
|
| 2 weeks |
32000 |
|
|
| 1 month (30 days) |
41000 |
|
|
| 2 months (60 days) |
51,300 |
|
|
In terms of memory for 30,000 sequences (the largest PCA tried) the calculation used up to 27GB but about a week into the calculation it used for a relatively short amount of time 38GB which would result in an out of memory error for 32GB machines.
So summing up, the limit to PCA size of calculation with Jalview is time (predictable) and peak memory usage (not easy to predict) .