University of Twente Student Theses

Login

Robust and Scalable Selective Sweep Detection using Convolutional Neural Networks

Belt, S.P. van den (2024) Robust and Scalable Selective Sweep Detection using Convolutional Neural Networks.

Full text not available from this repository.

Full Text Status:Access to this publication is restricted
Embargo date:2 July 2025
Abstract:Localizing DNA mutations that have led to positive natural selection is important for understanding viruses and diseases, the development of drugs, and many other applications. Traditionally, statistical models processing single nucleotide polymorphisms (SNPs) have been applied to detect a distinct pattern in SNP data indicative of positive natural selection, called a selective sweep. These models can effectively and efficiently detect selective sweeps under simple evolutionary models but are prone to high false positive rates when confronted with data where confounding factors are present. Recently, machine learning-based methods have been demonstrated to be more robust to confounding factors. Using convolutional neural networks (CNNs) the classification accuracy of selective sweeps in the presence of confounding factors is improved, however, CNNs are computationally expensive compared to statistical methods, limiting the usefulness of these methods when applied to large datasets. In this thesis, FAST-NN is presented, which uses 1D convolutions to process allele frequencies and pairwise SNP distances. FAST-NN achieves a selective sweep classification accuracy on all tested datasets that outperforms or performs on par with the state-of-the-art, while decreasing execution time on both CPU and GPU. Previous work that focuses on using CNNs for selective sweep detection primarily evaluates classification performance and does not explicitly evaluate the performance and precision of models when applied for detection. In addition, CNN-based selective sweep detection methods apply a sliding window or grid over a genome to sample windows for classification. When performing fine-grained detection, the sampled windows overlap, and if each window is processed separately, this leads to repeated, redundant, computations. The FAST-NN model was designed considering the classification of separate genomic segments. In this thesis, the FASTER-NN model extends the FAST-NN model and takes advantage of wide input windows through a larger receptive field. FASTER-NN is specifically designed to optimize detection performance, and has an improved detection sensitivity compared to state-of-the-art models, while only processing allele frequencies and pairwise SNP distances. Moreover, by using dilated convolutions and optimizing data reuse, the execution time is nearly invariant to input width. FASTER-NN enables whole-genome scans at a considerably reduced execution time compared to other CNN-based detection methods, paving the way for accessible CNN-based selective sweep detection, without requiring expensive hardware such as a GPU. FASTER-NN has been extended to classify selective sweeps and recombination hotspots through a single model. This extended model demonstrates the effect of partially retaining information on linkage disequilibrium on recombination hotspot classification. By using grouped allele frequencies, the execution time of recombination hotspot classification through a CNN can be reduced, with a negligible effect on classification accuracy. Selective sweep detection through CNNs can be accelerated by using reconfigurable hardware. FPGAs are constrained by limited I/O bandwidth and hardware resources but can implement customized on-chip parallel pipelines that sustain high data rates. Through the compact data format of allele frequencies, and by using only 1D convolutions, the FAST-NN model requires limited I/O bandwidth and hardware resources. To this end, an 8-bit quantized version of the model is deployed on an FPGA in a fully-pipelined architecture, running at 90 MHz with an initiation interval of one clock cycle. This solution can densely scan all 22 human autosomes in 135 milliseconds, classifying one window of 128 SNP positions per clock cycle.
Item Type:Essay (Master)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Electrical Engineering MSc (60353)
Link to this item:https://purl.utwente.nl/essays/100517
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page