Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published March 30, 2018 | public
Journal Article Open

A statistical model for improved membrane protein expression using sequence-derived features


The heterologous expression of integral membrane proteins (IMPs) remains a major bottleneck in the characterization of this important protein class. IMP expression levels are currently unpredictable, which renders the pursuit of IMPs for structural and biophysical characterization challenging and inefficient. Experimental evidence demonstrates that changes within the nucleotide or amino-acid sequence for a given IMP can dramatically affect expression levels; yet these observations have not resulted in generalizable approaches to improve expression levels. Here, we develop a data-driven statistical predictor named IMProve, that, using only sequence information, increases the likelihood of selecting an IMP that expresses in E. coli. The IMProve model, trained on experimental data, combines a set of sequence-derived features resulting in an IMProve score, where higher values have a higher probability of success. The model is rigorously validated against a variety of independent datasets that contain a wide range of experimental outcomes from various IMP expression trials. The results demonstrate that use of the model can more than double the number of successfully expressed targets at any experimental scale. IMProve can immediately be used to identify favorable targets for characterization. Most notably, IMProve demonstrates for the first time that IMP expression levels can be predicted directly from sequence.

Additional Information

© 2018 American Society for Biochemistry and Molecular Biology, Inc. Published under license by The American Society for Biochemistry and Molecular Biology, Inc. Received November 22, 2017. Accepted January 29, 2018. We thank Daniel Daley and Thomas Miller's group for discussion, Yaser Abu-Mostafa and Yisong Yue for guidance regarding machine learning, Niles Pierce for providing NUPACK source code (33), Welison Floriano and Naveed Near-Ansari for maintaining local computing resources, and Samuel Schulte for suggesting the model's name. We thank Michiel Niesen, Stephen Marshall, Thomas Miller, Reid van Lehn, James Bowie, and Tom Rapoport for comments on the manuscript. Models and analyses are possible thanks to raw experimental data provided by Daniel Daley and Mikaela Rapp (20); Nir Fluman (29); Edda Kloppmann, Brian Kloss, and Marco Punta from NYCOMPS (2, 3); Pikyee Ma (46); Renaud Wagner (49); Florent Bernaudat (53), and Constance Jeffrey (47). We acknowledge funding from an NIH Pioneer Award to WMC (5DP1GM105385); a Benjamin M. Rosen graduate fellowship, a NIH/NRSA training grant (5T32GM07616), and a NSF Graduate Research fellowship to SMS; and an Arthur A. Noyes Summer Undergraduate Research Fellowship to NJ. Computational time was provided by Stephen Mayo and Douglas Rees. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. 1144469. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1053575 (108). The authors declare that they have no conflicts of interest with the contents of this article. Author Contributions: S.M.S., A.M., and W.M.C. conceived the project. S.M.S. developed the approach. S.M.S., A.M., and N.J. compiled sequence and experimental data. N.J. created code to demonstrate feasibility. S.M.S. performed all published calculations. S.M.S. and W.M.C. wrote the manuscript.

Attached Files

Supplemental Material - 134046_1_supp_57854_p2qtl9__1_.xlsx

Supplemental Material - 134046_1_supp_57863_p2q6g6.docx

Supplemental Material - 134046_1_supp_57869_p2qggg.pdf

Supplemental Material - 134046_1_supp_57870_p2qcgb.pdf

Supplemental Material - 134046_1_supp_57871_p2qmgb.pdf

Supplemental Material - 134046_1_supp_57872_p2qmgb.pdf

Supplemental Material - 134046_1_supp_57873_p2qmgb.pdf

Published - J._Biol._Chem.-2018-Saladi-4913-27.pdf


Files (4.1 MB)
Name Size Download all
51.3 kB Download
720.2 kB Preview Download
126.8 kB Preview Download
272.0 kB Preview Download
68.8 kB Preview Download
233.7 kB Download
555.3 kB Preview Download
2.1 MB Preview Download

Additional details

August 19, 2023
August 19, 2023