k-spaces: Mixtures of Gaussian latent variable models
Creators
Abstract
Principal component analysis (PCA) and k -means clustering are two seemingly different methods for dimension reduction and clustering, respectively, but can be understood as special cases of inference in a Gaussian latent variable model framework. We leverage this insight to develop a probabilistic framework and methods for simultaneous dimension reduction, clustering, and latent space learning that are efficient and interpretable, and that can replace current ad hoc combinations of PCA and clustering. The algorithm, k -spaces, has broad applicability, which we demonstrate in several distinct genomic settings. In particular, we show how k -spaces can be used to model gene expression in quantitative hybridization chain reaction (qHCR) images, for inference in epigenomics, and for dimension reduction of single-cell RNA-sequencing data.
Copyright and License
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Funding
National Institute of General Medical Sciences, https://ror.org/04q48ey07, 1F30GM156092-01
National Human Genome Research Institute, U24HG010859, R01HG012967, R01HG013736
Acknowledgement
We thank the Banff International Research Station for supporting and hosting the workshop on challenges and synergies in the analysis of large-scale population based biomedical data in Oaxaca, Mexico in 2017. The workshop led to discussions between BE and LP that led to this project. We thank Caleb Ghione, Ulrich Herget, and Peter Currie for their kind assistance in interpreting HCR data and discussions about zebrafish embryology and anatomy. This work was supported by NIH F30 1F30GM156092-01 to NM and by NIH-NHGRI grant U24HG010859 to PWS, a Bren Professor of Biology, and by the Beckman Institute at Caltech (PMTC). BEE was funded in part by grants from the Parker Institute for Cancer Immunology (PICI), the Chan-Zuckerberg Institute (CZI), the Biswas Family Foundation, NIH NHGRI R01 HG012967, and NIH NHGRI R01 HG013736. BEE is a CI-FAR Fellow in the Multiscale Human Program.
Data Availability
The k-spaces software package is available at https://github.com/pachterlab/k-spaces.
The purified human blood cell reference epigenomics data was accessed on September 11, 2024 at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35069. The GALA II epigenomics data was accessed on September 17, 2024 at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77716. Code to preprocess data is included in our notebooks in our repository but sample and probe IDs for the full and the filtered datasets are additionally available at https://github.com/pachterlab/MEPSP_2024/data_paper.
HCR data was downloaded as part of the Readout/Read-in Software Package on November 15, 2024 from https://www.moleculartechnologies.org/info/software. Both the raw data and preprocessed matrices are available in our repository at https://github.com/pachterlab/MEPSP_2024/data_paper.
The C. elegans embryogenesis data was accessed on August 25, 2025 at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126954.
The human-mouse mixture, 10K PBMC, and 5K PMBC datasets were downloaded from the 10X Genomics website at https://www.10xgenomics.com/datasets/10k-hgmm-3p-gemx, https://www.10xgenomics.com/datasets/10k-human-pbmcs-3-v3-1-chromium-x-without-introns-3-1-high, and https://www.10xgenomics.com/datasets/5k_Human_Donor4_PBMC_3p_gem-x (filtered matrix), respectively.
We used anndata v0.9.2, MATLAB v24.1.0.2653294 (R2024a) Update 5, Numpy v1.23.5, Pandas v2.1.4, Python v3.10.9, ReadoutReadin v1.0, ReFACTor v1.0, v1.0scanpy v1.10.2, scipy v1.11.4, and sklearn v1.3.0. The full environment is available at https://github.com/pachterlab/MEPSP_2024. Code for reproducing the results and figures in the manuscript is available at https://github.com/pachterlab/MEPSP_2024 under Notebooks.
Supplemental Material
-
Supplement[supplements/690254_file02.pdf]
Files
2025.11.24.690254v1.full.pdf
Files
(23.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:96021ffc732f6ee12b7d4084acf24b8e
|
3.9 MB | Preview Download |
|
md5:ba2f1cd8560929dda7a81f401f26177a
|
20.0 MB | Preview Download |
Additional details
Funding
- National Institute of General Medical Sciences
- 1F30GM156092-01
- National Human Genome Research Institute
- U24HG010859
- National Human Genome Research Institute
- R01HG012967
- National Human Genome Research Institute
- R01HG013736