You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

FinBERT-GICS Sector Classifier v3

Read the Article Click the Image to read the full Medium article

Model Summary

finbert-gics-sector-classifier-v3 is the strongest performer in a family of GICS sector classification models developed to categorize financial news headlines into their corresponding Global Industry Classification Standard (GICS) sectors. While based on the financial language understanding of ProsusAI/finbert, this variant introduces a richer input representation by prefixing each headline with Named Entity Recognition (NER) tags, rather than embedding them inside the text.

By adding prefix tokens such as [ORG] (organizations), [LOC] (locations), [PER] (people), and [GPE] (geopolitical entities) before the headline, the model receives clear structural cues that improve its ability to interpret relationships, events, and entities that drive sector relevance. These prefixes summarize the detected entities ahead of the headline itself, which significantly enhances classification accuracy—especially for headlines involving multinational activity, regulatory shifts, corporate actions, and macroeconomic developments.

Among all variants, finbert-gics-sector-classifier-v3 consistently delivers the most context-aware and reliable predictions, making it well-suited for automated dataset labeling, financial research workflows, sentiment pipelines, and sector-aware market analysis applications.

Intended Use

This model is designed for classifying financial news headlines into GICS sectors to support downstream analytics, dataset preparation, and real-time financial NLP applications. It is especially useful when building time-series models, sector sentiment indicators, or large-scale automated labeling pipelines.

Primary Use Cases

  • Automatically assigning GICS sector labels to financial headlines
  • Preparing labeled datasets for sentiment or forecasting models
  • Powering dashboards or monitoring tools that group headlines by sector
  • Enabling sector-specific risk, sentiment, or trend analysis
  • Supporting academic and applied financial NLP research

Appropriate Users

  • Data scientists and ML engineers
  • Quantitative researchers and analysts
  • Developers building finance-focused NLP systems
  • Students and researchers working with financial text

Out-of-Scope Uses

The model is not intended for:

  • Predicting stock returns or price movements
  • Serving as trading or investment advice
  • Analyzing long-form documents without preprocessing
  • Use on non-financial or general news
  • High-stakes financial decisions without human review

Important Considerations

  • The model relies heavily on the presence of organization names to avoid labeling clickbait or generic text. As a result, headlines like "Stocks trending up" are correctly labeled as Unknown. However, this also means that headlines such as "Technology companies are going bankrupt"—which contain real sector signal—will also be labeled Unknown if they do not reference specific companies, tickers, or sector-linked entities.

  • Outputs are probabilistic and may reflect source data biases. Because the model was trained on real-world financial news, predictions may inherit biases related to coverage frequency, sector emphasis, or media attention (DBU: distribution-based uncertainty).

  • Accuracy is lower on very long or multi-sentence inputs. The model is optimized for short-form financial headlines, and longer inputs may dilute entity signal or introduce noise.

  • Model performance depends heavily on consistent entity tagging. Variations in company names, tickers, abbreviations, or missing NER annotations can affect prediction reliability.

  • Some sectors may remain underrepresented due to real-world imbalance. Even with class weighting, rare sectors due to limited training examples.

Training Data

This model was trained on a custom two-tier dataset (Gold and Silver) created from an original unlabeled corpus that contained:

  • headline
  • ticker (one per instance, from metadata)
  • publication date

Sector labels were assigned strictly by ticker, using Yahoo Finance screeners to map each ticker to its GICS sector. The Gold vs Silver tiers were used only for curriculum learning order. The model never sees or predicts these tier labels.

Gold Data

Gold-tier examples represent the strongest alignment between:

  1. The metadata ticker and
  2. The headline text itself

A headline was assigned to Gold-tier if the headline mentioned ANY ONE of the following:

  • The company name associated with the metadata ticker
  • The ticker symbol itself
  • The correct GICS sector name of that ticker

Gold-tier = Ticker exists AND the headline text contains a direct reference to the same company/ticker/sector.

Examples:

Metadata ticker: AAPL
Headline: "Apple shares rise after record earnings"

Metadata ticker: TSLA
Headline: "TSLA drops despite strong delivery numbers"

Metadata ticker: XOM (sector = Energy)
Headline: "Energy stocks gain as crude oil jumps"

If a headline matched any of the three, it was Gold-tier.

Silver Data

Silver-tier examples also had a valid metadata ticker, but the headline text did NOT mention:

  • the corresponding company name,
  • the ticker, nor
  • the correct sector name for that ticker.

Silver-tier = Ticker exists, but headline mentions an unrelated organization.

Example pattern:

Metadata ticker = NFLX (Netflix)
Headline mentions Twitch, Hulu, Meta, etc. — but not Netflix, NFLX, or Communication Services.

Examples:

Metadata ticker: NFLX
Headline: "Many streaming companies including Twitch experienced outages"

Metadata ticker: F (Ford)
Headline: "GM announces new EV lineup for 2025"

Metadata ticker: AMZN
Headline: "Walmart expands next-day delivery network"

If a headline referenced a different organization and not the one tied to the metadata ticker, it was Silver-tier.

Unknown Sector Data

Headlines assigned to the unknown sector class were arbitrarily placed into the Gold tier solely to avoid empty tier values in the dataset. This tier assignment does not reflect higher confidence—it's just a placeholder so every example had a tier label. During curriculum training, the model was exposed to unknown examples in both the Silver and Gold phases, ensuring that these headlines appeared throughout all stages of learning.

Curriculum Training Strategy

To maximize generalization:

  1. 3 epochs on silver-tier data
  2. 2 epochs on gold-tier data

The staged progression teaches the model to understand implicit context before specializing on strong explicit signals.

Training Procedure

The model was trained for a total of 8 epochs using curriculum learning and a frozen-layer warmup.

Curriculum Stages

  1. Classification Layer Warm-up with Frozen Layers (3 epochs)
    • Learning Rate: 2e-5
  2. Silver-tier training (3 epochs)
    • Learning rate: 2e-5
  3. Gold-tier training (2 epochs)
    • Learning rate: 5e-6

Hyperparameters

  • Batch size: 128
  • Optimizer: AdamW
  • Learning rates:
    • 2e-5 for epochs 1–6
    • 5e-6 for epochs 4-5
  • Precision: FP16 mixed precision
  • Weight decay: default
  • Total epochs: 8

Hardware

  • GPU: NVIDIA A100
  • FP16 enabled for memory and speed improvements

Classification Head Initialization and Warmup

Before curriculum training began, the original FinBERT classification layer was replaced with a newly initialized classifier:

model.classifier = nn.Linear(768, len(label2id))
nn.init.xavier_uniform_(model.classifier.weight)
nn.init.zeros_(model.classifier.bias)

Classification Head Details

  • Input dimension: 768
  • Output dimension: len(label2id) = 12 classes
  • Weights: Xavier uniform initialization
  • Bias: Zero initialization

Warmup (Frozen Encoder)

Before full training began, all transformer encoder layers were frozen and only the newly initialized classification head was trained for 3 epochs. This warmup step helps stabilize the model and prevents the encoder weights from being disrupted by large, uncalibrated gradients at the start. This allowed:

  • Stable gradient updates
  • Prevention of early catastrophic forgetting
  • Formation of initial decision boundaries

After warmup, all layers were unfrozen for full fine-tuning.


Input Format

This model accepts financial news headlines formatted with NER-based entity tags that are added as prefixes before the headline, rather than being inserted inside the text. These prefix tags provide explicit structural cues that help the model identify organizations, people, locations, and geopolitical entities — all crucial for sector prediction.

The model was trained primarily on prefix-tagged inputs. While it can accept raw headlines, unformatted text often results in the model returning the class unknown, because the entity cues expected by the model are missing.


NER Tagging and Input Transformation

Before sending a headline to the model, a Named Entity Recognition (NER) system should be used to detect key entities and produce the prefix tags that the model expects.

NER model used:
Jean-Baptiste/roberta-large-ner-english

The NER model identifies entity spans in the headline, but instead of inserting tags around those spans, the detected entities are collected and converted into prefix tokens of the form:

Supported Special Tokens

  • [ORG] — Organizations / companies
  • [PER] — People / individuals
  • [LOC] — Physical locations
  • [GPE] — Geopolitical entities (countries, states, cities)

Example: Raw → Tagged

Raw headline:

Apple shares rise after strong iPhone demand in China.

Tagged headline:

[ORG]  Apple [LOC]  China [SEP] Apple shares rise after strong iPhone demand in China.

Since the model was trained in text formatted used NER tagging, using the model on an unformatted headline will produce the output class unknown

Training and Validation Metrics

Epoch Curriculum T. Loss T. Accuracy T. Precision T. Recall T. F1 V. Loss V. Accuracy V. Precision V. Recall V. F1
1 Silver 740.2343 0.8680 0.8721 0.8680 0.8689 45.3172 0.9406 0.9427 0.9406 0.9410
2 Silver 298.8771 0.9472 0.9488 0.9472 0.9476 41.1392 0.9459 0.9470 0.9459 0.9462
3 Silver 235.3753 0.9556 0.9568 0.9556 0.9559 40.7655 0.9454 0.9486 0.9454 0.9463
4 Gold 368.6536 0.8959 0.8987 0.8959 0.8964 21.5810 0.9496 0.9511 0.9496 0.9498
5 Gold 141.3825 0.9589 0.9594 0.9589 0.9589 14.3957 0.9674 0.9680 0.9674 0.9674

Testing Metrics

Note: Curriculum indicates which portion of the dataset the model was trained on at the time — Silver = context-only headlines, Gold = explicit company/ticker/sector headlines.

Input Strategy Curriculum Accuracy Precision Recall F1 Score
3 Silver 0.8300 0.8472 0.8300 0.8341
3 Gold 0.9477 0.9485 0.9477 0.9479

Class Mappings

{
  "0": "basic materials",
  "1": "communication services",
  "2": "consumer cyclical",
  "3": "consumer defensive",
  "4": "energy",
  "5": "financial services",
  "6": "healthcare",
  "7": "industrials",
  "8": "real estate",
  "9": "technology",
  "10": "unknown",
  "11": "utilities"
}

Use this code to format the headlines before passing them through the model

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load NER pipeline
ner_pipeline = pipeline(
    "ner",
    model="Jean-Baptiste/roberta-large-ner-english",
    aggregation_strategy="simple"
)

# Inference Formatter
def format_headline(headline):
    # Run NER
    ents = ner_pipeline(headline)

    # Buckets for entities
    entity_buckets = {
        "ORG": [],
        "LOC": [],
        "PER": [],
        "GPE": []
    }

    # Fill buckets
    for ent in ents:
        tag = ent["entity_group"]
        text = ent["word"]

        if tag in entity_buckets:
            entity_buckets[tag].append(text)

    # Build prefix with tags 
    prefix = ""

    for tag, items in entity_buckets.items():
        if items:
            prefix += f"[{tag}] " + " | ".join(items) + " "

    # If prefix exists, add [SEP]
    if prefix:
        prefix = prefix.strip() + " [SEP] "

    # Final formatted input
    final_text = prefix + headline
    return final_text


# Example
headline = "Apple expands operations in China to boost iPhone production."
print(format_headline(headline))

Expcted Output

[ORG]  Apple [LOC]  China [SEP] Apple expands operations in China to boost iPhone production.

Live Sector Classification Demo

Try the interactive demo here:

https://huggingface.co/spaces/alemmrr/filbert-gics-sector-classifier-ui

This Space version includes:

  • Automatic formatting
  • All sector probability scores
  • Fully interactive UI

License

This model is released under the Creative Commons Attribution–NonCommercial 4.0 International (CC BY-NC 4.0) license.

You are free to use, share, and modify the model for non-commercial purposes only, as long as you credit the original author. Commercial use of this model or any derivative work is not permitted.

Full license text: https://creativecommons.org/licenses/by-nc/4.0/

Citation

If you use this model in your research, projects, or derivatives, please cite

@misc{finbert_gics_v3,
  title        = {FinBERT-GICS Sector Classifier v3},
  author       = {Morales, Alejandro},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/alemmrr/finbert-gics-sector-classifier-v3}},
  note         = {NER-enhanced FinBERT model for GICS sector classification},
}
Downloads last month
53
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alemmrr/finbert-gics-sector-classifier-v3

Base model

ProsusAI/finbert
Finetuned
(77)
this model

Space using alemmrr/finbert-gics-sector-classifier-v3 1