Predicting Membrane Protein Expression in Yeast from Sequence-Derived Features
Despite comprising one-quarter of most organisms' proteome and serving as the target of over half of all drugs, integral membrane proteins remain difficult to characterize. Poor expression in heterologous systems often hinders IMP study, and large-scale efforts to express IMPs have proven time-consuming, costly, and capricious. As such, we recently used quantitative experimental expression studies to train a machine learning model capable of predicting membrane protein expression in Escherichia coli solely from sequence-derived features. Though our linear bacterial model generalizes well to eukaryotic membrane proteins expressed in E. coli, we observe poor prediction for IMPs heterologously expressed in yeast, a host frequently chosen for its greater similarity to higher eukaryotes. Thus, we report a new model capable of predicting IMP expression in Saccharomyces cerevisiae. To avoid overfitting resulting from the limited size of our training dataset, the number of sequence-derived features used to predict expression is reduced from the 89 used for the E. coli model to just eight. Strikingly, in agreement with recent findings in the wet laboratory, the disorder of the C-terminus is identified as the most predictive feature. We additionally incorporate new features, including predicted N- and O-glycosylation and disulfide bond formation, into our algorithm. We are working to verify the model across a wide variety of small- and large-scale expression datasets from the literature. We will share our predictor with the broader community to help accelerate membrane protein biochemical and biophysical study.
© 2017 Elsevier B.V. Available online 3 February 2017. Meeting Abstract: 1746-Pos.