Efficient Storage and Regression Computation for Population-Scale Genome Sequencing Studies
With the increasing availability of large-scale population biobanks, the potential of Whole Genome Sequencing (WGS) data in human health and disease research has been significantly enhanced. However, the massive computational and storage demands of WGS data pose significant challenges to research institutions, especially those with limited funding or researchers in developing countries. This disparity in resource allocation limits the fairness of cutting-edge genetic research. To address this issue, Manuel A. Rivas, Christopher Chang, and their team have developed novel algorithms and regression methods that significantly reduce the computational time and storage requirements for WGS studies, particularly focusing on the handling of rare variants.
Source of the Paper
This paper was co-authored by Manuel A. Rivas and Christopher Chang. Rivas is from the Department of Biomedical Data Science at Stanford University, while Chang is employed at Grail Inc. The paper was published on February 11, 2025, in the journal Bioinformatics, titled Efficient Storage and Regression Computation for Population-Scale Genome Sequencing Studies. The paper details how they significantly improved the efficiency of WGS studies by optimizing algorithms and storage methods.
Research Process
1. Research Objectives
The primary goal of the research is to develop a method that significantly reduces the storage requirements and computational time for WGS data, particularly focusing on the handling of rare variants. By integrating these methods into PLINK 2.0, the researchers aim to substantially enhance the efficiency of large-scale genomic data analysis without compromising analytical accuracy.
2. Research Methods
a) Data Compression and Storage Optimization
The researchers developed a novel data compression algorithm that significantly reduces the storage requirements for WGS data. This algorithm leverages patterns in genetic variants, particularly the characteristics of rare variants, to achieve a compact representation of the data. Specifically, PLINK 2.0 introduced the PGEN format, which employs a sparse representation for rare variants. For example, in a dataset of 400,000 samples, a variant with only one alternate allele observation would require 100,000 bytes in the PLINK 1 binary format, but only 4 bytes in the file header and 5 bytes in the body in the PGEN format.
b) Regression Computation Optimization
The researchers also developed new regression computation methods to address the large scale and complexity of WGS data. Traditional regression methods are inefficient when handling large-scale data, so they adopted sparse computation techniques to significantly improve processing speed. Specifically, PLINK 2.0’s --glm command performs sparse-genotype-based linear and logistic regression. By optimizing the computation process, the researchers were able to significantly reduce computational time when processing large-scale data.
3. Experimental Design
To validate the effectiveness of these methods, the researchers conducted an exome-wide association analysis using 19.4 million variants and body mass index (BMI) phenotype data from 125,077 individuals in the All of Us project. The results showed that using the new methods in PLINK 2.0, the computation time was significantly reduced from 695.35 minutes (11.5 hours) on a single machine to 1.57 minutes (using 30GB of memory and 50 threads) or 8.67 minutes (using 4 threads).
4. Multi-Phenotype Analysis
The researchers extended the method to support multi-phenotype analysis. Using 50 phenotypes for a genome-wide association analysis, the results showed that the analysis could be completed in just 52 minutes and 38 seconds using a single virtual machine (30GB of memory and 50 threads). Additionally, they introduced the --pheno-svd flag, which preprocesses phenotype data using singular value decomposition (SVD), further improving computational efficiency.
Main Results
1. Data Compression Efficiency
The researchers compared the storage requirements of different file formats for exome sequencing data from the All of Us project. The results showed that the PGEN format in PLINK 2.0 requires only 39.0GB of storage space, representing a 98% compression compared to the PLINK 1 BED file (2TB), a 90% compression compared to the VCF file (403GB), and a 77% compression compared to the BGEN file (165GB).
2. Computational Efficiency Improvement
In the exome-wide association analysis, using the new methods in PLINK 2.0, the computation time was significantly reduced from 695.35 minutes on a single machine to 1.57 minutes (using 50 threads) or 8.67 minutes (using 4 threads). Additionally, when analyzing type 2 diabetes phenotype data, the computation time using the cc-residualize mode was only 7.68 minutes (using 50 threads), compared to 102.9 minutes using the firth-fallback mode.
3. Multi-Phenotype Analysis Efficiency
In multi-phenotype analysis, preprocessing phenotype data with the --pheno-svd flag reduced the computation time from 50 minutes to 2 minutes, further enhancing computational efficiency.
Conclusion
This study significantly reduced the storage requirements and computational time for WGS research by developing novel data compression and regression computation methods, particularly focusing on the handling of rare variants. These methods not only enhance the efficiency of large-scale genomic data analysis but also provide more equitable research opportunities for underfunded institutions and researchers in developing countries.
Research Highlights
- Significant Data Compression Efficiency: The PGEN format compresses storage requirements by 98%, significantly reducing the storage costs for large-scale genomic data.
- Substantial Improvement in Computational Efficiency: By optimizing regression computation methods, the computation time was reduced from 11.5 hours to 1.57 minutes, greatly enhancing analysis efficiency.
- Support for Multi-Phenotype Analysis: The researchers extended the method to support multi-phenotype analysis, further improving the flexibility of large-scale genomic data analysis.
- Equitable Research Opportunities: These methods provide more equitable research opportunities for underfunded institutions and researchers in developing countries, promoting the democratization of genomic research.
Significance and Value
This study not only provides efficient tools for large-scale genomic data analysis but also makes significant contributions to the democratization and fairness of genomic research. By significantly reducing storage requirements and computational time, these methods enable researchers to process and analyze large-scale genomic data more efficiently, accelerating the pace of scientific discovery. Additionally, these methods offer more equitable research opportunities for underfunded institutions and researchers in developing countries, promoting the global democratization of genomic research.