slow loading of 300000+ seq fasta file

kmourao · 31 May 2017 15:17

Hi –

I’ve just found the reason the file I was looking at was not loading (or actually is loading but extremely slowly). The bad news is it looks like it’s been in the code since Aug 2016 but the good news is it looks very fixable.

The initialisation is being held up in Alignment::resolveAndAddDatasetSeq which is in the call stack called by the AlignFrame initialisation code. The code here has a LinkedIdentityHashSet seqs containing all the sequences it has looked at so far. When it gets the next sequence it calls seqs.contains to see if the sequence is already in seqs, if not, it adds the sequence to seqs. The slow bit is the seqs.contains call: if I have 85,000 sequences in seqs it takes around 0.5 secs to do one check, so another 200,000+ is going to take a while…

The reason seqs.contains is slow is because, despite the name, LinkedIdentityHashSet::contains is doing a linear search. This rather echoes what I was saying earlier about checking our data structures are appropriate.

I’ll log a JIRA issue for this. It would be useful to know what the purpose of using LinkedIdentityHashSet here was though, as this is the only place it’s used in the code.

Cheers

Kira

···

Kira Mourão

Postdoctoral Researcher
The Barton Group
Division of Computational Biology
School of Life Sciences
University of Dundee, Dundee, Scotland, UK.

k.mourao@dundee.ac.uk

www.jalview.org
www.compbio.dundee.ac.uk

Twitter @kiramt

We’re Scottish University of the Year again!

jalviewcrowdadmin · 31 May 2017 19:37

Well hunted, Kira.

I’ve just found the reason the file I was looking at was not loading (or
actually is loading but extremely slowly). The bad news is it looks like
it’s been in the code since Aug 2016 but the good news is it looks very
fixable.

The initialisation is being held up in
Alignment::resolveAndAddDatasetSeq which is in the call stack called by
the AlignFrame initialisation code.

This was added to avoid duplicate sequence import when opening Ensembl
or ENA CDS, if I remember correctly (though Mungo may have a better story).

The reason seqs.contains is slow is because, despite the name,
LinkedIdentityHashSet::contains is doing a linear search. This rather
echoes what I was saying earlier about checking our data structures are
appropriate.

natch.

I’ll log a JIRA issue for this. It would be useful to know what the
purpose of using LinkedIdentityHashSet here was though, as this is the
only place it’s used in the code.

The use of IdentityHash was to spot duplicates based on the Object
reference (ie equivalence based on == rather than .equals() ). However,
I'd have hoped the contains would not simply do linear search. ISTR a
LinkedHashSet was chosen for order preservation, which made life easier
for the CDS/Splitframe logic.

Some relevant issues: JAL-2132, which may have been the original reason
for this bit of logic back in 2016. That issue is overshadowed by the
real requirement: full normalisation (JAL-407).

I was idly googling IdentityHashMap to see if there are any workarounds.
We could simply enforce primary keys and hash on those
(SequenceI.getVamsasId() would fit that), but I also found this library
https://bitbucket.org/trove4j/trove
via School - JA-VA Code.

..Jim.

···

On 31/05/2017 16:17, Kira Mourao (Staff) wrote: