When Genomics Aotearoa Postdoctoral Researcher Ann McCartney first shared her plans to run a new de novo sequencing application (Chromium10X linked-read technology) with staff at NeSI (New Zealand eScience Infrastructure), it was clear this would be quite a challenge.
NeSI provides Genomics Aotearoa with access to dedicated high-memory compute services. Although NeSI staff had no experience with the new technology’s only available assembler, Supernova, their combination of bioinformatics expertise and platform resources successfully overcame the barriers. The result was the first stick insect genome assembly created using link read technology.
A supernova problem
Ann planned to use Chromium10X linked read technology to assemble the genomes of four endemic species (the hihi bird and stick insects - Niveaphasma, Clitarchus, Acanthoxyla).
Stick insects have larger and repetitive genomes, and therefore require much long-read sequencing to ensure a high quality genome. However, the huge size of these genomes meant long-reads would have been too expensive, while short-read sequencing does not provide the accuracy required.
Linked-reads were chosen as a compromise between short-read and long-read sequencing as they provide pseudo-long reads for a fraction of the price. However, de novo assemblers that provide this option can be very demanding from an I/O and memory perspective and often don’t fit well into traditional HPC modes of delivery, with characteristics such as very long run-times and difficult to predict resource requirements.
Not only did the planned assemblies lack a reference genome, they were also some of most complex genomes to date, at a scale beyond that which the software vendor provided support and guidance for.
Massive computing solution is needed
NeSI threw a 4TB, 64-core node at the problem - NeSI calls these hugemem compute nodes. These are a particularly important platform capability, enabling scientists to tackle big problems by scaling up rather than out.
There were several issues to resolve, even after figuring out some of the nuances of how best to craft the parameters for each de novo assembly, taking memory and CPU requirements into account.
One problem was that the application would stall part way through the pipeline, causing a whole node failure that forced a system reboot.
Solving this problem involved forming a 16bp kmer microassembly to polish scaffolds created through DeBrujn graph construction (ASSEMBLER_M2). After several attempts tuning various aspects of the system, NeSI platform engineers identified that they were hitting a system level deadlock likely due to a bug somewhere in the IBM Spectrum Scale filesystem (nee GPFS) client drivers.
Solving this new issue required an upgrade of Spectrum Scale, something that could only be done during a complete outage of the platforms – fortunately NeSI had already scheduled maintenance work.
Ann’s assembly was able to pass the M2 stage of the pipeline that had been stalling and the stick insect genome assembly was completed after 22 days running - the longest job to be run on the NESI platform to date.
The implications of using link-read technology
How is having this stick insect genome better than any other? The large and highly repetitive stick insect genome was the perfect test as to whether pseudo-long or linked-read technology was a more affordable, appropriate sequencing platform for genomes of this nature.
Due to its success other endemic New Zealand species are now being sequenced using this technology, including the blueberry, the hihi, the myna and the rewarewa.
The resulting pipeline is now enabling Genomics Aotearoa researchers to decide what other genomes from species of interest from conservation and primary production can be sequenced and assembled using such technology.