7RGR image
Deposition Date 2021-07-15
Release Date 2021-07-28
Last Version Date 2024-04-03
Entry Detail
PDB ID:
7RGR
Keywords:
Title:
Lysozyme 056 from Deep neural language modeling
Biological Source:
Source Organism:
Method Details:
Experimental Method:
Resolution:
2.48 Å
R-Value Free:
0.29
R-Value Work:
0.25
R-Value Observed:
0.26
Space Group:
P 21 21 21
Macromolecular Entities
Polymer Type:polypeptide(L)
Molecule:Artificial protein L056
Chain IDs:A (auth: B), B (auth: A)
Chain Length:168
Number of Molecules:2
Biological Source:synthetic construct
Primary Citation
Large language models generate functional protein sequences across diverse families.
Nat.Biotechnol. ? ? ? (2023)
PMID: 36702895 DOI: 10.1038/s41587-022-01618-2

Abstact

Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.

Legend

Protein

Chemical

Disease

Primary Citation of related structures