Data-Driven Computational Protein Design

Computational protein design can generate proteins that are not found in nature that adopt desired structures and perform new functions. Although it is theoretically possible to design proteins from scratch, the actual success comes from the use of a large amount of data describing the sequence, structure, and function of existing proteins and their variants. Researchers demonstrated recent creative uses of multiple sequence alignments, protein structure, and high-throughput functional analysis in computational protein design. Methods range from using experimental data to enhance structure-based design, to building regression models, to training deep neural networks that generate new sequences. Looking to the future, deep learning will become more and more important for maximizing the value of protein design data.


The goal of computational protein design is to produce new proteins, fold them into desired structures and perform useful or interesting functions. The early formulation of this problem is convincing, it is considered to find a combination of residues suitable for the natural skeleton, much like a stereochemical puzzle. There are many examples of using this method to redesign natural proteins or complexes to improve or change their stability or function. This central design concept has been elaborated with great success, and now it is possible to construct proteins from scratch using fragments of secondary structure and loops or recombined fragments of existing folds. In order to evaluate candidate atomic resolution design models, the widely used Rosetta program quantifies sequence-skeleton fitting based on pseudo-physical potentials, including torsional strain, residue desolvation, van der Waals, electrostatic and hydrogen bonding interactions Contribution. In order to obtain the best performance, the Rosetta scoring function incorporates analytical and statistical terms using existing structural parameterization. Currently, computational protein design research is rapidly changing. The revolution in high-throughput DNA synthesis and sequencing and advances in machine learning are shaping the field. Soon, most design methods will use data in ways far beyond what is done today. In this article, the researchers emphasize the way to incorporate different types of data into computational protein design, sometimes in combination with structure-based modeling, but often as an alternative to this approach. Researchers focus on three types of data: multiple sequence alignments (MSA) of evolutionary related proteins, experimental protein structures, and data from high-throughput experiments.

MSAs data

Homologs from different organisms can provide hundreds of thousands of examples of proteins with related structures and functions, and the alignment of homologous sequences can be mined to obtain statistically significant patterns. Quantitative analysis of residue covariance at paired positions in large MSA can be used to score the effect of mutations on stability and predict protein structure and protein-protein interactions. The ability to predict residue-residue distance maps from large comparisons has made DeepMind's protein structure prediction method AlphaFold a surprising success.

Protein structure data

Design methods such as Rosetta, FOLD-X, and EvoDesign/EvoEF2 scoring use protein structure to quantify the contribution parameters of protein stability. Several groups explored the possibility of more direct evaluation of sequence-structure compatibility, using information from experimental structures, without the need to build all-atom models or decompose energy into physical terms that are difficult to accurately approximate.

Experimental data

Sequence-based design can generate new proteins similar to known family members, but deep MSA is not always available. Structure-based methods have made significant progress in protein design, but the success rate on difficult problems is very low. Introducing experimental data into the design process can promote success. An increasingly common approach to solving difficult design problems is to conduct the first round of structure-based design, test many candidates, and then perform a deep mutation scan of the most promising molecules.


  1. Frappier, V., & Keating, A. E. Data-driven computational protein design. Current Opinion in Structural Biology, 2021, 69, 63-69.
  2. Shin, JE., Riesselman, A.J., Kollasch, A.W. et al. Protein design and variant prediction using autoregressive generative models. Nat Commun. 2021,12, 2403.
* For Research Use Only.