EMFP: ESM-2 Micropeptide Predictor for Canonical Functional Proteins

EMFP is designed to identify peptide sequences that may encode canonical functional proteins, as defined by molecular function annotations in the UniProt database. While many peptides can be bioactive, EMFP specifically focuses on distinguishing peptides with protein-like molecular functions from those with other or unknown mechanisms of action.

Fine-tuned ESM-2 (650M) model for predicting peptides encoding canonical functional proteins.

Performance

Task EMFP Random Forest ESM+MLP ProtBERT+MLP
Authenticity 0.967 0.718 0.892 0.856
Canonical Protein Function 0.932 0.505 0.827 0.791

Note: "Canonical Protein Function" refers to peptides encoding proteins with molecular function annotations (enzyme activity, binding activity, etc.) as defined in UniProt.

Usage

import torch
import esm

model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
checkpoint = torch.load("best_model.pt")

class ESMClassifier(torch.nn.Module):
    def __init__(self, esm_model, num_labels=2, hidden_dim=1280, dropout=0.1):
        super().__init__()
        self.esm = esm_model
        self.classifier = torch.nn.Sequential(
            torch.nn.Dropout(dropout),
            torch.nn.Linear(hidden_dim, hidden_dim // 2),
            torch.nn.ReLU(),
            torch.nn.Dropout(dropout),
            torch.nn.Linear(hidden_dim // 2, num_labels)
        )
    
    def forward(self, tokens):
        results = self.esm(tokens, repr_layers=[33], return_contacts=False)
        return self.classifier(results["representations"][33][:, 0, :])

classifier = ESMClassifier(model)
classifier.load_state_dict(checkpoint['model_state_dict'])
classifier.eval()

# Predict
batch_converter = alphabet.get_batch_converter()
data = [("protein1", "MKTAYIAKQRQISFVKSHFSRQLEERLG")]
labels, strs, tokens = batch_converter(data)

with torch.no_grad():
    logits = classifier(tokens)
    probs = torch.softmax(logits, dim=1)
    print(f"Probability of encoding canonical functional protein: {probs[0, 1].item():.4f}")

Model Details

  • Base: ESM-2 650M (esm2_t33_650M_UR50D)
  • Training: 26,626 peptide sequences from UniProt with molecular function annotations
  • Optimizer: AdamW, FP16 precision
  • Size: 7.4 GB

Download

huggingface-cli download huangruihua/EMFP best_model.pt --local-dir ./

GitHub

Full code: https://github.com/huangruihua/EMFP

Citation

@software{emfp_2026,
  title={EMFP: ESM-2 Micropeptide Predictor for Canonical Functional Proteins},
  author={Huang, Rui-Hua},
  year={2026},
  url={https://github.com/huangruihua/EMFP}
}

License

MIT - Rui-Hua Huang (@huangruihua)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support