deep research: folding algos

Introduction

Computational modeling of biomolecular folding and design has evolved dramatically over the past 20 years. This report reviews two major categories of approaches – traditional physics/knowledge-based methods and modern deep learning-based models – used for DNA/RNA secondary structure prediction, protein folding, de novo design, and drug discovery. We first introduce essential background concepts (thermodynamics of folding, sequence alignment, RNA structure basics, etc.) to set the stage. We then group the methods by methodology (traditional vs. deep learning) and chronologically within each group. For each notable model or algorithm, we outline its architecture, training data, design rationale, comparisons to similar methods, and key application areas (e.g. protein structure prediction, RNA folding, protein design in synthetic biology, or drug discovery).

Background Concepts

Thermodynamics of Biomolecular Folding

Biomolecular folding is governed by thermodynamics – molecules adopt the conformation that minimizes the free energy. Free Energy Landscapes: Folding can be visualized as a funnel-shaped energy landscape, where the unfolded state ensemble has high free energy and entropy, and the native folded state lies at a deep free-energy minimum . Despite many possible conformations, proteins fold efficiently because the landscape biases the search toward the native basin (the “funnel” hypothesis) . Entropy-Enthalpy Trade-off: Folding is unfavorable in entropy (loss of conformational freedom) but favorable in enthalpy (gain of stabilizing interactions). The native state’s stability comes from a balance of forces – hydrophobic burial, hydrogen bonds, electrostatic interactions – that collectively outweigh the entropy loss . Levinthal’s Paradox: Cyrus Levinthal noted it’s combinatorially impossible for a protein to sample all conformations randomly to find the native state (estimated ~$10^{143}$ possibilities) within biologically relevant time . The paradox is resolved by the funnel-like energy landscape: folding is not random but a directed process with metastable intermediates and pathways leading downhill in free energy . In summary, proteins (and structured RNA) fold via a biased search on a rough energy funnel, minimizing Gibbs free energy (ΔG = ΔH – TΔS) where favorable enthalpic interactions and the hydrophobic effect drive the chain into a unique low-energy structure despite entropic costs.

Schematic of a protein folding funnel. The vertical axis represents free energy, and the horizontal cross-section represents the conformational entropy. Unfolded conformations (top) are high in energy and highly entropic. As folding progresses (downward), the available conformations funnel toward the native structure at the bottom – a deep energy minimum . This illustrates how a protein avoids a random search of all possibilities by following energetically downhill routes (arrows) toward the native fold.

Sequence Alignment and Evolutionary Information

Modern bioinformatics relies on sequence alignments to deduce structural and functional clues. Pairwise Alignment: Algorithms like Needleman–Wunsch (global alignment) and Smith–Waterman (local alignment) use dynamic programming to find optimal residue-residue alignments between two sequences . These ensure the highest scoring match (using substitution matrices and gap penalties) is found, albeit with $O(n^2)$ time complexity . Such pairwise alignments reveal conserved residues (often functionally or structurally important) and regions of divergence. Multiple Sequence Alignment (MSA): Extends this to align many homologous sequences simultaneously. MSAs, often built with tools like ClustalΩ or MUSCLE, highlight conserved motifs across a family and are foundational for many structure prediction methods . Position-specific scoring matrices or profiles derived from MSAs capture evolutionary constraints at each position. Evolutionary Couplings: By analyzing MSAs, one can detect correlated mutations – pairs of residues that co-evolve (if one mutates, the other tends to mutate to compensate). Such covariation signals indicate physical contacts in the 3D structure . Direct coupling analysis (DCA) uses statistical models (maximum entropy / Potts models) to infer which residue pairs contact each other by distinguishing direct correlations from transitive effects . This was a breakthrough in the early 2010s: the top predicted contacts from sequence covariance were accurate enough to fold proteins de novo in some cases . These concepts will reappear when discussing MSA-based structure prediction and co-evolution methods. Profile HMMs: Another alignment approach uses hidden Markov models to represent the profile of a sequence family, capturing position-specific insertion/deletion probabilities and residue frequencies. Tools like HMMER build profile HMMs from MSAs, which are sensitive for searching distant homologs by alignment scoring. Overall, sequence alignment techniques provide the evolutionary context that many modern algorithms (both “traditional” and deep learning) leverage to predict structure or design new sequences.

Example of a multiple sequence alignment (MSA) for a protein across different species. Identical or highly conserved residues are highlighted in the same color. Conserved blocks (solid blue/purple regions) often indicate structural or functional motifs that persist through evolution, while variable positions (mixed colors) suggest tolerance to mutations . Co-evolution analysis of such alignments can reveal which residues likely contact each other in the folded structure.

RNA Folding and Structural Biology Basics

RNA Folding Rules: Unlike proteins, RNA primarily folds into predictable secondary structures (hairpins, stems, loops) driven by Watson–Crick base pairing (A·U, G·C) and Wobble pairing (G·U). Dynamic programming algorithms (e.g. Zuker’s algorithm) compute minimal free-energy secondary structures by considering all possible pairings. Nearest-neighbor thermodynamic parameters (experimentally measured stacking energies, loop penalties) are used to calculate the free energy of candidate structures, selecting the lowest-energy fold as the prediction . For example, tools like RNAfold (ViennaRNA) or mfold apply these principles to predict RNA secondary structure from sequence. Pseudoknots (where bases form crossing pairings) are more challenging and often not handled by simple DP algorithms. Many RNAs also fold with the help of cations or chaperones in vivo. Nonetheless, the general principle is that RNA self-assembles into structures that maximize base pairings and favorable stacking (enthalpy) while minimizing loop entropies, analogous to protein folding but with simpler base-pairing rules.

Protein Motifs and Structure Determination: Proteins have hierarchical structure: local secondary structures (α-helices, β-sheets, turns) assemble into tertiary structures (3D folds, often made of recurring motifs or domains). Examples of motifs include helix-turn-helix (common in DNA-binding proteins), the Rossmann fold (binds nucleotides), or the beta-alpha-beta unit. Conserved sequence motifs often correspond to active sites or binding regions (e.g. the Cys-His dipeptide in zinc-finger motifs coordinates Zn²⁺). Understanding these recurring patterns helps guide protein design and prediction algorithms – for instance, homology modeling assumes a new protein will adopt the known fold of a similar sequence (thereby inheriting its motifs and topology). Structural Biology Fundamentals: The experimental determination of macromolecular structures provides the ground truth data for computational models. X-ray crystallography and cryo-EM reveal atomistic protein structures (at resolutions ~1–3 Å) by diffracting crystals or imaging frozen samples, while NMR spectroscopy determines structures in solution for smaller proteins . The Protein Data Bank (PDB) stores over 190,000 experimentally solved structures. These databases serve as training and benchmarking resources for folding algorithms. Historically, structure prediction was the “protein folding problem”: how to computationally predict a protein’s 3D structure from its amino acid sequence alone (Anfinsen’s dogma). Over decades, methods evolved from comparative modeling using known structures to physics-based folding simulations to the recent deep learning revolution that has essentially solved single-protein structure prediction. The following sections will trace that evolution, starting with traditional methods that laid the groundwork.

Traditional Computational Models (2000–2015)

Traditional approaches to protein and RNA folding relied on physical energy calculations, statistical potentials, and evolutionary analysis. They often required expert-crafted algorithms and significant computing for tasks like conformational search or molecular simulation. Below we discuss key models and methods such as homology modeling, Rosetta’s fragment-based folding, molecular dynamics simulations (and Markov state models), and MSA-based coevolution methods. These approaches dominated the field prior to deep learning and established the “prior art” in computational biology.

Homology Modeling and Threading (1990s–2000s)

Architecture & Method: Homology modeling predicts a protein’s 3D structure by aligning its sequence to one or more proteins of known structure (templates) and “modeling” the target based on the template backbone. This assumes the target and template share an evolutionary ancestor and thus a similar fold. The process involves sequence alignment, copying the template’s coordinates for aligned regions, and rebuilding insertions or loops. Knowledge-based scoring (favoring conserved core residues, etc.) and refinement are applied to adjust side-chains or optimize geometry. Threading is a related approach that fits a sequence into a library of known folds (even without clear sequence similarity) using position-specific scoring (derived from profiles or potentials). Early software like MODELLER (1993, Sali & Blundell) automated comparative modeling using restraints derived from template alignment, and HHpred (2005) improved remote homology detection via HMM–HMM alignments. In 2008, Zhang’s I-TASSER pipeline combined threading to find templates, followed by fragment assembly and refinement, producing some of the top predictions in CASP competitions .

Training Data & Databases: These methods fundamentally rely on the PDB as the source of template structures. Template libraries are filtered to avoid redundancy (e.g., sequences <30% identical are kept to cover diverse folds). Sequence databases (UniProt, GenBank) provide homologs to build profiles or HMMs for sensitive alignment. I-TASSER and others use fragments from the PDB for regions without templates. No machine learning “training” occurs per se; rather, the knowledge comes from empirical distributions of structures (backbone conformations, side-chain rotamers, etc.) observed in known proteins.

Inspiration and Rationale: The rationale is Anfinsen’s observation that sequence determines structure, coupled with the fact that evolution reuses successful folds. If a new protein is similar in sequence to a known one, it likely shares the fold. Even with low sequence identity (<25%), threading can recognize structural motifs by matching sequence profiles to structural environments. These methods were inspired by biology: proteins in the same family (e.g. kinases, immunoglobulins) generally preserve structure. By leveraging that, one bypasses the need for ab initio folding.

Comparison with Similar Models: Homology modeling’s accuracy depends on having a suitable template. When a close homolog (>50% identity) exists, models can be quite accurate (1–2 Å RMSD). However, for novel folds (no template), these methods fail. This gap motivated development of ab initio methods like Rosetta (below) and later deep learning. Threading could sometimes detect distant homologs that simple BLAST misses, but it struggled with very remote relationships. Compared to physics-based folding, comparative modeling is fast and was more reliable in the early 2000s, but it cannot predict truly new folds. I-TASSER distinguished itself by incorporating de novo folding for unaligned regions and iterative template fragment assembly, outperforming pure homology methods in many cases . Still, as of mid-2010s, homology modeling was the standard if a template was available, while ab initio was a last resort for novel sequences.

Application Areas: Homology models were invaluable in drug discovery and biochemistry for providing structural models of targets when experimental structures were unavailable. For example, homology models of GPCRs guided ligand docking studies before many GPCR crystal structures were solved. In synthetic biology, comparative modeling helped design proteins by modifying known scaffolds – one could model the mutant to ensure the fold is retained. Threading was used to predict folds for genomic ORFs, contributing to structural genomics efforts. Overall, these methods enabled researchers to study structure-function relationships in silico by capitalizing on evolutionary knowledge. They set a baseline that later methods (like AlphaFold) needed to surpass.