The flow of information in a protein is from sequence to structure to function, with each step driven by the previous one. Protein design is based on reversing this process: specifying a desired function, designing a structure to perform that function, and finding a sequence that can fold into that structure. This "central law" is the basis for almost all new protein design efforts. Our ability to accomplish these tasks relies on our understanding of protein folding and function, and our ability to capture this understanding in computational methods. In recent years, deep learning-derived methods have enabled us to move beyond the design of protein structures to the design of functional proteins in terms of efficient and accurate structural modeling and enrichment of successful designs.
Designing functional proteins from scratch begins with identifying the features needed to accomplish the desired function. After defining a functional motif, designing a protein structure that satisfies its constraints is one of the most challenging aspects of protein design. Traditional backbone-based design approaches to designing protein structures provide the most interpretable way to model protein structures. For example, design incorporating key structural insights improves our ability to control the structure of β-barrel formation (Fig. 3a), which is important in enzyme and membrane protein applications. The application of deep learning has witnessed tremendous protein design changes as the ability to manipulate protein structure in response to functional constraints has changed dramatically. Using a new design approach similar to that based on the original energy landscape, learning and statistical potentials can replace physically-based potentials to guide structural searches, giving a similar ability to generate new structures and topologies as found in nature. The arrival of highly accurate protein structure prediction using the AlphaFold system, and the subsequent development of trRosetta and RoseTTAFold, opened up new ways to generate proteins.
In protein design, the rise of diffusion generative models (Figure 4) marks an important advance in that these models provide more stable training and better diversity than other types of generative models while maintaining high sample quality. These models start with white noise and denoise rough features first, then fill in the details, rather than attempting to synthesize the complete atomic structure all at once. This inductive bias, or learning architecture, matches well with the hierarchical nature of protein structure, decomposing the structure generation problem into first a high-level tertiary structure organization, then a local secondary structure, and finally chemical details.
With the advent of accurate structure prediction methods such as AlphaFold, it has become possible to compare the predicted folding of a design sequence with the original design structure. Relatively fast computations make it possible to predict the folding state of a design sequence as well as a confidence measure (e.g., pLDDT or pAE). One might expect a sequence that is predicted to fold back into the design structure with high confidence ("self-consistent" or "designable") to be more consistent with the design structure, and thus may be more likely to fold in the wet lab. Overall, these findings significantly improve the speed and efficiency of method development, as models and designed sequences can be more faithfully evaluated in the computer without the need for slower and more laborious wet-lab validation feedback (Figure 5).
Reference