EMFP: ESM-2 Micropeptide Predictor for Canonical Functional Proteins
EMFP is designed to identify peptide sequences that may encode canonical functional proteins, as defined by molecular function annotations in the UniProt database. While many peptides can be bioactive, EMFP specifically focuses on distinguishing peptides with protein-like molecular functions from those with other or unknown mechanisms of action.
Fine-tuned ESM-2 (650M) model for predicting peptides encoding canonical functional proteins.
Performance
| Task | EMFP | Random Forest | ESM+MLP | ProtBERT+MLP |
|---|---|---|---|---|
| Authenticity | 0.967 | 0.718 | 0.892 | 0.856 |
| Canonical Protein Function | 0.932 | 0.505 | 0.827 | 0.791 |
Note: "Canonical Protein Function" refers to peptides encoding proteins with molecular function annotations (enzyme activity, binding activity, etc.) as defined in UniProt.
Usage
import torch
import esm
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
checkpoint = torch.load("best_model.pt")
class ESMClassifier(torch.nn.Module):
def __init__(self, esm_model, num_labels=2, hidden_dim=1280, dropout=0.1):
super().__init__()
self.esm = esm_model
self.classifier = torch.nn.Sequential(
torch.nn.Dropout(dropout),
torch.nn.Linear(hidden_dim, hidden_dim // 2),
torch.nn.ReLU(),
torch.nn.Dropout(dropout),
torch.nn.Linear(hidden_dim // 2, num_labels)
)
def forward(self, tokens):
results = self.esm(tokens, repr_layers=[33], return_contacts=False)
return self.classifier(results["representations"][33][:, 0, :])
classifier = ESMClassifier(model)
classifier.load_state_dict(checkpoint['model_state_dict'])
classifier.eval()
# Predict
batch_converter = alphabet.get_batch_converter()
data = [("protein1", "MKTAYIAKQRQISFVKSHFSRQLEERLG")]
labels, strs, tokens = batch_converter(data)
with torch.no_grad():
logits = classifier(tokens)
probs = torch.softmax(logits, dim=1)
print(f"Probability of encoding canonical functional protein: {probs[0, 1].item():.4f}")
Model Details
- Base: ESM-2 650M (
esm2_t33_650M_UR50D) - Training: 26,626 peptide sequences from UniProt with molecular function annotations
- Optimizer: AdamW, FP16 precision
- Size: 7.4 GB
Download
huggingface-cli download huangruihua/EMFP best_model.pt --local-dir ./
GitHub
Full code: https://github.com/huangruihua/EMFP
Citation
@software{emfp_2026,
title={EMFP: ESM-2 Micropeptide Predictor for Canonical Functional Proteins},
author={Huang, Rui-Hua},
year={2026},
url={https://github.com/huangruihua/EMFP}
}
License
MIT - Rui-Hua Huang (@huangruihua)
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support