Weekly assignment 3

Homework

For in-person students please answer atleast 2-3 questions in Part A and 1 question in Part C. Questions marked with * are for extra credit

Part A: Using the PDB

<aside> ⚠️ Mandatory for MIT/Harvard Students and Committed Listeners Due at the start of class March 5

</aside>

Answer any of the following questions by Shuguang Zhang:
- How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
  - Assuming a 25% protein composition in meat, me have 125 gr of protein in 500 gr of meat. Me know that 1 Dalton equals 1.6605e-24 grams and that 1 aminoacid is around 100 daltons so an aminoacid is around 1.6605e-22 grams. So if we have 125 gr / 1.6605e-22 grams we deduce that we heve 7,527e23 amino acids in 500 gr off meat
- Why humans eat beef but do not become a cow, eat fish but do not become fish?
  - Because me the fish and cow proteins we eat are broken down to their amino acid monomers in our stomach. So what enter our bloodstream are this monomers but not the protein with its function ( the same thing why eating collagen isn´t keeping anybody young).
- Why there are only 20 natural amino acids?
  - There are only 4 nucleic acids (ATCG) so there are only 64 possible combinations of bases to form codons consisting of three bases. By doing this nature makes sure that each aminoacid is represented multiple times ( redundancy) which makes the changes in a protein sequence due to DNA mutations more difficult.
- Can you make other non-natural amino acids? Design some new amino acids.
- Where did amino acids come from before enzymes that make them, and before life started?
  - During the first steps of life on earth, different organic compounds such as amino acids were created in the primal broth according to the Oparin´s theory. In this context, organisms could obtain them from this broths, just like we get ours from food, but the organisms that could synthesize them from scrap had a better fitness than those who couldn´t.
- If you make an alpha-helix using D-amino acids, what handedness (right or left) would you expect?Can you discover additional helices in proteins?
  - The alpha helix would be left handed. In the case of L-aminoacids, the angles required to form a left-handed helix cannot be achieved because of estheric impediments between the R grupo and the oxigen. Meanwhile, in the D-Amino acids, the situation is the opposite.
- Why most molecular helices are right-handed?
- Why do beta-sheets tend to aggregate?
  - What is the driving force for b-sheet aggregation?
- Why many amyloid diseases form b-sheet?
  - Can you use amyloid b-sheets as materials?
  - B-sheet are deeply hydrophobic, the R groups are positiones above and under the plane they form. These residues are usually small and hydrophobic, so if due to the misfolding of the protein or mutations in its sequence these hidrophobic regiones became expose, they can interact with each other, leading to aggregates that are insoluble.
- Design a b-sheet motif that forms a well-ordered structure.
  - Val-Ser-Ile-Thr-Leu-Tyr-Gln-Asn-Leu-Phe-Asp-Cys-Arg-Leu-Met-Arg-Glu-Asp-His-Ile

Part B: Protein Analysis & Visualization

<aside> ⚠️ Mandatory for MIT/Harvard Students and Committed Listeners Due at the start of class March 5

</aside>

In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins.

Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions.
- Briefly describe the protein you selected and why you selected it.Identify the amino acid sequence of your protein.
  - I chose the Lolium perenne antifreeze protein, which has a beta roll structure and is the only protein of the grass family afps to have an uploaded structure in PDB
  - How long is it? What is the most frequent amino acid? You can use this notebook to count most frequent amino acid - https://colab.research.google.com/drive/1vlAU_Y84lb04e4Nnaf1axU8nQA6_QBP1?usp=sharing
    - It is 133 amino acids long and the mos common one is asparagine.
  - How many protein sequence homologs are there for your protein?
    
    Hint: Use the pBLAST tool to search for homologs and ClustalOmega to align and visualize them. Tutorial Here
  - Does your protein belong to any protein family?
- Identify the structure page of your protein in RCSB
  - When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better(Resolution: 2.70 Å)
  - Method: X-RAY DIFFRACTION
  - Resolution: 1.40 Å
  - Are there any other molecules in the solved structure apart from protein?
  - Yes EDO, EOH and water
  - Does your protein belong to any structure classification family?
- Open the structure of your protein in any 3D molecule visualization software:
- Examine and analyze your protein, visually:Visualize the protein as "cartoon", "ribbon" and "ball and stick".
- Color the protein by secondary structure. Does it have more helices or sheets?
- Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
- Visualize the surface of the protein. Does it have any "holes" (aka binding pockets)?
Part C. Using ML based protein tools

<aside> ⚠️ Mandatory for MIT/Harvard Students Due at the start of class March 5

</aside>
1. Using a protein structure prediction model of your choice
a) Pick a protein in the PDB and fold its sequence using any of the protein structure prediction models. Does the protein fold into the same shape ?

b) Do you notice any difference in the predicted structure and the PDB structure ? UPDATED 03/03 - You see difference between the predicted structure and the actual structure using the tutorial here

c) Are there any low confidence [by confidence we mean the plDDT score) regions in your protein ? If so, why is the confidence score of structure prediction model low in that region of the protein ? UPDATED 03/03 - Checkout the tutorial if you need help with seeing this in PyMOL. If you can’t see both the predicted structure and another file that you add. You can click on reset on the top right box in PyMOL

d) If there are low confidence regions do you think it would affect your ability to engineer the protein for a specific function ? What can you do in your design pipeline to account for it? [Extra Credit]
1. Using a sequence recovery model (MPNN)
a) Generate sequence proposals for PDBID: 1BCF chain A.

b) And fold 1-2 sequences using a protein structure model from Question 1.

c) Is there a way to enable the newly designed sequences to preserve their binding to di-iron ? Refer to : https://www.bakerlab.org/wp-content/uploads/2022/11/Diffusion_preprint_12012022.pdf?ref=assemblyai.com UPDATED 03/03 - To answer this question, all you have to do is keep some parts of the protein sequence of 1BCF constant - You can find which regions in the document linked above. Just search for 1BCFUsing a Generative Model

a) You are a scientist trying to design a new drug binder for COVID-19. Generate a protein backbone that can bind to SARS-CoV-2 spike protein. Use PDB ID: 6M0J or any other target for identifying a binding pocket

b) Generate sequences for your newly sampled backbone and fold 1 or 2 of them. Visualize them using your favorite protein visualization tool from Part B.

c) How can you rank and select the new protein sequences to test in the lab ?

d) How can you experimentally verify if your newly designed binder binds to the target ? [Eg: Yeast Surface Display, Degradation Assays etc]
```
 **e)** If you design a binder that strongly binds to SARS-CoV-2, what's the next step in your design pipeline ? What are some possible issues in its application as a drug in humans ? [*Extra Credit*]
 
 f) Here using RFdiffusion, we designed a mini-protein binder. However many therapeutic protein binders designed are typically antibodies. What are some advantages of antibody binders ?  [*Extra Credit*]
```
1. Engineering thermo-stability of enzymes
  
  a) Pick an enzyme you are interested in [eg: PETase]
  
  b) Summarize the function of this protein
  
  c) Can you engineer a version of your protein that functions at high temperatures ?
  
  d) How can you utilize machine learning tools for designing this protein ? [Extra Credit]
  
  e) How would you test the thermo-stability of your newly designed enzyme ? [Extra Credit]

Untitled