Published November 28, 2025 | Version Submitted
Discussion Paper Open

k-spaces: Mixtures of Gaussian latent variable models

  • 1. ROR icon California Institute of Technology
  • 2. ROR icon University of Southern California
  • 3. ROR icon Stanford University
  • 4. ROR icon Gladstone Institutes

Abstract

Principal component analysis (PCA) and k -means clustering are two seemingly different methods for dimension reduction and clustering, respectively, but can be understood as special cases of inference in a Gaussian latent variable model framework. We leverage this insight to develop a probabilistic framework and methods for simultaneous dimension reduction, clustering, and latent space learning that are efficient and interpretable, and that can replace current ad hoc combinations of PCA and clustering. The algorithm, k -spaces, has broad applicability, which we demonstrate in several distinct genomic settings. In particular, we show how k -spaces can be used to model gene expression in quantitative hybridization chain reaction (qHCR) images, for inference in epigenomics, and for dimension reduction of single-cell RNA-sequencing data.

Copyright and License

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.

Funding

National Institute of General Medical Sciences, https://ror.org/04q48ey07, 1F30GM156092-01

National Human Genome Research Institute, U24HG010859, R01HG012967, R01HG013736

Acknowledgement

We thank the Banff International Research Station for supporting and hosting the workshop on challenges and synergies in the analysis of large-scale population based biomedical data in Oaxaca, Mexico in 2017. The workshop led to discussions between BE and LP that led to this project. We thank Caleb Ghione, Ulrich Herget, and Peter Currie for their kind assistance in interpreting HCR data and discussions about zebrafish embryology and anatomy. This work was supported by NIH F30 1F30GM156092-01 to NM and by NIH-NHGRI grant U24HG010859 to PWS, a Bren Professor of Biology, and by the Beckman Institute at Caltech (PMTC). BEE was funded in part by grants from the Parker Institute for Cancer Immunology (PICI), the Chan-Zuckerberg Institute (CZI), the Biswas Family Foundation, NIH NHGRI R01 HG012967, and NIH NHGRI R01 HG013736. BEE is a CI-FAR Fellow in the Multiscale Human Program.

Data Availability

The k-spaces software package is available at https://github.com/pachterlab/k-spaces.

The purified human blood cell reference epigenomics data was accessed on September 11, 2024 at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35069. The GALA II epigenomics data was accessed on September 17, 2024 at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77716. Code to preprocess data is included in our notebooks in our repository but sample and probe IDs for the full and the filtered datasets are additionally available at https://github.com/pachterlab/MEPSP_2024/data_paper.

HCR data was downloaded as part of the Readout/Read-in Software Package on November 15, 2024 from https://www.moleculartechnologies.org/info/software. Both the raw data and preprocessed matrices are available in our repository at https://github.com/pachterlab/MEPSP_2024/data_paper.

The C. elegans embryogenesis data was accessed on August 25, 2025 at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126954.

The human-mouse mixture, 10K PBMC, and 5K PMBC datasets were downloaded from the 10X Genomics website at https://www.10xgenomics.com/datasets/10k-hgmm-3p-gemxhttps://www.10xgenomics.com/datasets/10k-human-pbmcs-3-v3-1-chromium-x-without-introns-3-1-high, and https://www.10xgenomics.com/datasets/5k_Human_Donor4_PBMC_3p_gem-x (filtered matrix), respectively.

We used anndata v0.9.2, MATLAB v24.1.0.2653294 (R2024a) Update 5, Numpy v1.23.5, Pandas v2.1.4, Python v3.10.9, ReadoutReadin v1.0, ReFACTor v1.0, v1.0scanpy v1.10.2, scipy v1.11.4, and sklearn v1.3.0. The full environment is available at https://github.com/pachterlab/MEPSP_2024. Code for reproducing the results and figures in the manuscript is available at https://github.com/pachterlab/MEPSP_2024 under Notebooks.

Supplemental Material

Files

2025.11.24.690254v1.full.pdf

Files (23.9 MB)

Name Size Download all
md5:96021ffc732f6ee12b7d4084acf24b8e
3.9 MB Preview Download
md5:ba2f1cd8560929dda7a81f401f26177a
20.0 MB Preview Download

Additional details

Funding

National Institute of General Medical Sciences
1F30GM156092-01
National Human Genome Research Institute
U24HG010859
National Human Genome Research Institute
R01HG012967
National Human Genome Research Institute
R01HG013736

Caltech Custom Metadata

Caltech groups
Division of Biology and Biological Engineering (BBE)
Publication Status
Submitted