A Caltech Library Service

A statistical model for improved membrane protein expression using sequence-derived features

Saladi, Shyam M. and Javed, Nauman and Müller, Axel and Clemons, William M. (2018) A statistical model for improved membrane protein expression using sequence-derived features. Journal of Biological Chemistry, 293 (13). pp. 4913-4927. ISSN 0021-9258.

[img] MS Word - Supplemental Material
See Usage Policy.

[img] PDF - Supplemental Material
See Usage Policy.

[img] PDF - Supplemental Material
See Usage Policy.

[img] PDF - Supplemental Material
See Usage Policy.

[img] PDF - Supplemental Material
See Usage Policy.

[img] PDF - Supplemental Material
See Usage Policy.

[img] MS Excel - Supplemental Material
See Usage Policy.


Use this Persistent URL to link to this item:


The heterologous expression of integral membrane proteins (IMPs) remains a major bottleneck in the characterization of this important protein class. IMP expression levels are currently unpredictable, which renders the pursuit of IMPs for structural and biophysical characterization challenging and inefficient. Experimental evidence demonstrates that changes within the nucleotide or amino-acid sequence for a given IMP can dramatically affect expression levels; yet these observations have not resulted in generalizable approaches to improve expression levels. Here, we develop a data-driven statistical predictor named IMProve, that, using only sequence information, increases the likelihood of selecting an IMP that expresses in E. coli. The IMProve model, trained on experimental data, combines a set of sequence-derived features resulting in an IMProve score, where higher values have a higher probability of success. The model is rigorously validated against a variety of independent datasets that contain a wide range of experimental outcomes from various IMP expression trials. The results demonstrate that use of the model can more than double the number of successfully expressed targets at any experimental scale. IMProve can immediately be used to identify favorable targets for characterization. Most notably, IMProve demonstrates for the first time that IMP expression levels can be predicted directly from sequence.

Item Type:Article
Related URLs:
URLURL TypeDescription Information
Clemons, William M.0000-0002-0021-889X
Additional Information:© 2018 American Society for Biochemistry and Molecular Biology, Inc. Published under license by The American Society for Biochemistry and Molecular Biology, Inc. Received November 22, 2017. Accepted January 29, 2018. We thank Daniel Daley and Thomas Miller’s group for discussion, Yaser Abu-Mostafa and Yisong Yue for guidance regarding machine learning, Niles Pierce for providing NUPACK source code (33), Welison Floriano and Naveed Near-Ansari for maintaining local computing resources, and Samuel Schulte for suggesting the model’s name. We thank Michiel Niesen, Stephen Marshall, Thomas Miller, Reid van Lehn, James Bowie, and Tom Rapoport for comments on the manuscript. Models and analyses are possible thanks to raw experimental data provided by Daniel Daley and Mikaela Rapp (20); Nir Fluman (29); Edda Kloppmann, Brian Kloss, and Marco Punta from NYCOMPS (2, 3); Pikyee Ma (46); Renaud Wagner (49); Florent Bernaudat (53), and Constance Jeffrey (47). We acknowledge funding from an NIH Pioneer Award to WMC (5DP1GM105385); a Benjamin M. Rosen graduate fellowship, a NIH/NRSA training grant (5T32GM07616), and a NSF Graduate Research fellowship to SMS; and an Arthur A. Noyes Summer Undergraduate Research Fellowship to NJ. Computational time was provided by Stephen Mayo and Douglas Rees. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. 1144469. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1053575 (108). The authors declare that they have no conflicts of interest with the contents of this article. Author Contributions: S.M.S., A.M., and W.M.C. conceived the project. S.M.S. developed the approach. S.M.S., A.M., and N.J. compiled sequence and experimental data. N.J. created code to demonstrate feasibility. S.M.S. performed all published calculations. S.M.S. and W.M.C. wrote the manuscript.
Funding AgencyGrant Number
NIH Predoctoral Fellowship5T32GM07616
NSF Graduate Research FellowshipUNSPECIFIED
Arthur A. Noyes Summer Undergraduate Research FellowshipUNSPECIFIED
NSF Graduate Research FellowshipDGE-1144469
Caltech Summer Undergraduate Research Fellowship (SURF)UNSPECIFIED
Subject Keywords:machine-learning, prediction, protein expression, membrane protein, membrane biogenesis, structural biology, membrane biophysics, computational biology
Record Number:CaltechAUTHORS:20180205-111319761
Persistent URL:
Official Citation:Shyam M. Saladi, Nauman Javed, Axel Müller, and William M. Clemons Jr. A statistical model for improved membrane protein expression using sequence-derived features J. Biol. Chem. 2018 293: 4913-. doi:10.1074/jbc.RA117.001052
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:84675
Deposited By: Ruth Sustaita
Deposited On:05 Feb 2018 21:03
Last Modified:04 Apr 2018 23:22

Repository Staff Only: item control page