Computational Medicine Center Faculty Papers

Machine Learning Approaches Identify Genes Containing Spatial Information From Single-Cell Transcriptomics Data.

Phillipe Loher, Computational Medicine Center, Thomas Jefferson UniversityFollow
Nestoras Karathanasis, Thomas Jefferson University; University of CreteFollow

Document Type

Article

Publication Date

2-1-2021

Comments

This article has been peer reviewed. It was published in: Frontiers in Genetics.

Volume 11, 1 February 2021, Article number 612840.

The published version is available at DOI: https://doi.org/10.3389/fgene.2020.612840

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Abstract

The development of single-cell sequencing technologies has allowed researchers to gain important new knowledge about the expression profile of genes in thousands of individual cells of a model organism or tissue. A common disadvantage of this technology is the loss of the three-dimensional (3-D) structure of the cells. Consequently, the Dialogue on Reverse Engineering Assessment and Methods (DREAM) organized the Single-Cell Transcriptomics Challenge, in which we participated, with the aim to address the following two problems: (a) to identify the top 60, 40, and 20 genes of the Drosophila melanogaster embryo that contain the most spatial information and (b) to reconstruct the 3-D arrangement of the embryo using information from those genes. We developed two independent techniques, leveraging machine learning models from least absolute shrinkage and selection operator (Lasso) and deep neural networks (NNs), which are applied to high-dimensional single-cell sequencing data in order to accurately identify genes that contain spatial information. Our first technique, Lasso.TopX, utilizes the Lasso and ranking statistics and allows a user to define a specific number of features they are interested in. The NN approach utilizes weak supervision for linear regression to accommodate for uncertain or probabilistic training labels. We show, individually for both techniques, that we are able to identify important, stable, and a user-defined number of genes containing the most spatial information. The results from both techniques achieve high performance when reconstructing spatial information in D. melanogaster and also generalize to zebrafish (Danio rerio). Furthermore, we identified novel D. melanogaster genes that carry important positional information and were not previously suspected. We also show how the indirect use of the full datasets’ information can lead to data leakage and generate bias in overestimating the model’s performance. Lastly, we discuss the applicability of our approaches to other feature selection problems outside the realm of single-cell sequencing and the importance of being able to handle probabilistic training labels. Our source code and detailed documentation are available at https://github.com/TJU-CMC-Org/SingleCell-DREAM/.

Recommended Citation

Loher, Phillipe and Karathanasis, Nestoras, "Machine Learning Approaches Identify Genes Containing Spatial Information From Single-Cell Transcriptomics Data." (2021). Computational Medicine Center Faculty Papers. Paper 32.
https://jdc.jefferson.edu/tjucompmedctrfp/32

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

PubMed ID

33633771

Download

Included in

Other Medical Sciences Commons

COinS

Computational Medicine Center Faculty Papers

Machine Learning Approaches Identify Genes Containing Spatial Information From Single-Cell Transcriptomics Data.

Document Type

Publication Date

Comments

Abstract

Recommended Citation

Creative Commons License

PubMed ID

Included in

Browse

Search

Author Corner

About the JDC

Links

Computational Medicine Center Faculty Papers

Machine Learning Approaches Identify Genes Containing Spatial Information From Single-Cell Transcriptomics Data.

Authors

Document Type

Publication Date

Comments

Abstract

Recommended Citation

Creative Commons License

PubMed ID

Included in

Share

Browse

Search

Author Corner

About the JDC

Links