You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

FinBERT-GICS Sector Classifier v3

Click the Image to read the full Medium article

Model Summary

finbert-gics-sector-classifier-v3 is the strongest performer in a family of GICS sector classification models developed to categorize financial news headlines into their corresponding Global Industry Classification Standard (GICS) sectors. While based on the financial language understanding of ProsusAI/finbert, this variant introduces a richer input representation by prefixing each headline with Named Entity Recognition (NER) tags, rather than embedding them inside the text.

By adding prefix tokens such as [ORG] (organizations), [LOC] (locations), [PER] (people), and [GPE] (geopolitical entities) before the headline, the model receives clear structural cues that improve its ability to interpret relationships, events, and entities that drive sector relevance. These prefixes summarize the detected entities ahead of the headline itself, which significantly enhances classification accuracy—especially for headlines involving multinational activity, regulatory shifts, corporate actions, and macroeconomic developments.

Among all variants, finbert-gics-sector-classifier-v3 consistently delivers the most context-aware and reliable predictions, making it well-suited for automated dataset labeling, financial research workflows, sentiment pipelines, and sector-aware market analysis applications.

Intended Use

This model is designed for classifying financial news headlines into GICS sectors to support downstream analytics, dataset preparation, and real-time financial NLP applications. It is especially useful when building time-series models, sector sentiment indicators, or large-scale automated labeling pipelines.

Primary Use Cases

Automatically assigning GICS sector labels to financial headlines
Preparing labeled datasets for sentiment or forecasting models
Powering dashboards or monitoring tools that group headlines by sector
Enabling sector-specific risk, sentiment, or trend analysis
Supporting academic and applied financial NLP research

Appropriate Users

Data scientists and ML engineers
Quantitative researchers and analysts
Developers building finance-focused NLP systems
Students and researchers working with financial text

Out-of-Scope Uses

The model is not intended for:

Predicting stock returns or price movements
Serving as trading or investment advice
Analyzing long-form documents without preprocessing
Use on non-financial or general news
High-stakes financial decisions without human review

Important Considerations

The model relies heavily on the presence of organization names to avoid labeling clickbait or generic text. As a result, headlines like "Stocks trending up" are correctly labeled as Unknown. However, this also means that headlines such as "Technology companies are going bankrupt"—which contain real sector signal—will also be labeled Unknown if they do not reference specific companies, tickers, or sector-linked entities.
Outputs are probabilistic and may reflect source data biases. Because the model was trained on real-world financial news, predictions may inherit biases related to coverage frequency, sector emphasis, or media attention (DBU: distribution-based uncertainty).
Accuracy is lower on very long or multi-sentence inputs. The model is optimized for short-form financial headlines, and longer inputs may dilute entity signal or introduce noise.
Model performance depends heavily on consistent entity tagging. Variations in company names, tickers, abbreviations, or missing NER annotations can affect prediction reliability.
Some sectors may remain underrepresented due to real-world imbalance. Even with class weighting, rare sectors due to limited training examples.

Training Data

This model was trained on a custom two-tier dataset (Gold and Silver) created from an original unlabeled corpus that contained:

headline
ticker (one per instance, from metadata)
publication date

Sector labels were assigned strictly by ticker, using Yahoo Finance screeners to map each ticker to its GICS sector. The Gold vs Silver tiers were used only for curriculum learning order. The model never sees or predicts these tier labels.

Gold Data

Gold-tier examples represent the strongest alignment between:

The metadata ticker and
The headline text itself

A headline was assigned to Gold-tier if the headline mentioned ANY ONE of the following:

The company name associated with the metadata ticker
The ticker symbol itself
The correct GICS sector name of that ticker

Gold-tier = Ticker exists AND the headline text contains a direct reference to the same company/ticker/sector.

Examples:

Metadata ticker: AAPL
Headline: "Apple shares rise after record earnings"

Metadata ticker: TSLA
Headline: "TSLA drops despite strong delivery numbers"

Metadata ticker: XOM (sector = Energy)
Headline: "Energy stocks gain as crude oil jumps"

If a headline matched any of the three, it was Gold-tier.

Silver Data

Silver-tier examples also had a valid metadata ticker, but the headline text did NOT mention:

the corresponding company name,
the ticker, nor
the correct sector name for that ticker.

Silver-tier = Ticker exists, but headline mentions an unrelated organization.

Example pattern:

Metadata ticker = NFLX (Netflix)
Headline mentions Twitch, Hulu, Meta, etc. — but not Netflix, NFLX, or Communication Services.

Examples:

Metadata ticker: NFLX
Headline: "Many streaming companies including Twitch experienced outages"

Metadata ticker: F (Ford)
Headline: "GM announces new EV lineup for 2025"

Metadata ticker: AMZN
Headline: "Walmart expands next-day delivery network"

If a headline referenced a different organization and not the one tied to the metadata ticker, it was Silver-tier.

Unknown Sector Data

Headlines assigned to the unknown sector class were arbitrarily placed into the Gold tier solely to avoid empty tier values in the dataset. This tier assignment does not reflect higher confidence—it's just a placeholder so every example had a tier label. During curriculum training, the model was exposed to unknown examples in both the Silver and Gold phases, ensuring that these headlines appeared throughout all stages of learning.

Curriculum Training Strategy

To maximize generalization:

3 epochs on silver-tier data
2 epochs on gold-tier data

The staged progression teaches the model to understand implicit context before specializing on strong explicit signals.

Training Procedure

The model was trained for a total of 8 epochs using curriculum learning and a frozen-layer warmup.

Curriculum Stages

Classification Layer Warm-up with Frozen Layers (3 epochs)
- Learning Rate: 2e-5
Silver-tier training (3 epochs)
- Learning rate: 2e-5
Gold-tier training (2 epochs)
- Learning rate: 5e-6

Hyperparameters

Batch size: 128
Optimizer: AdamW
Learning rates:
- 2e-5 for epochs 1–6
- 5e-6 for epochs 4-5
Precision: FP16 mixed precision
Weight decay: default
Total epochs: 8

Hardware

GPU: NVIDIA A100
FP16 enabled for memory and speed improvements

Classification Head Initialization and Warmup

Before curriculum training began, the original FinBERT classification layer was replaced with a newly initialized classifier:

model.classifier = nn.Linear(768, len(label2id))
nn.init.xavier_uniform_(model.classifier.weight)
nn.init.zeros_(model.classifier.bias)

Classification Head Details

Input dimension: 768
Output dimension: len(label2id) = 12 classes
Weights: Xavier uniform initialization
Bias: Zero initialization

Warmup (Frozen Encoder)

Before full training began, all transformer encoder layers were frozen and only the newly initialized classification head was trained for 3 epochs. This warmup step helps stabilize the model and prevents the encoder weights from being disrupted by large, uncalibrated gradients at the start. This allowed:

Stable gradient updates
Prevention of early catastrophic forgetting
Formation of initial decision boundaries

After warmup, all layers were unfrozen for full fine-tuning.

Input Format

This model accepts financial news headlines formatted with NER-based entity tags that are added as prefixes before the headline, rather than being inserted inside the text. These prefix tags provide explicit structural cues that help the model identify organizations, people, locations, and geopolitical entities — all crucial for sector prediction.

The model was trained primarily on prefix-tagged inputs. While it can accept raw headlines, unformatted text often results in the model returning the class unknown, because the entity cues expected by the model are missing.

NER Tagging and Input Transformation

Before sending a headline to the model, a Named Entity Recognition (NER) system should be used to detect key entities and produce the prefix tags that the model expects.

NER model used:
Jean-Baptiste/roberta-large-ner-english

The NER model identifies entity spans in the headline, but instead of inserting tags around those spans, the detected entities are collected and converted into prefix tokens of the form:

Supported Special Tokens

[ORG] — Organizations / companies
[PER] — People / individuals
[LOC] — Physical locations
[GPE] — Geopolitical entities (countries, states, cities)

Example: Raw → Tagged

Raw headline:

Apple shares rise after strong iPhone demand in China.

Tagged headline:

[ORG]  Apple [LOC]  China [SEP] Apple shares rise after strong iPhone demand in China.

Since the model was trained in text formatted used NER tagging, using the model on an unformatted headline will produce the output class unknown

Training and Validation Metrics

Epoch	Curriculum	T. Loss	T. Accuracy	T. Precision	T. Recall	T. F1	V. Loss	V. Accuracy	V. Precision	V. Recall	V. F1
1	Silver	740.2343	0.8680	0.8721	0.8680	0.8689	45.3172	0.9406	0.9427	0.9406	0.9410
2	Silver	298.8771	0.9472	0.9488	0.9472	0.9476	41.1392	0.9459	0.9470	0.9459	0.9462
3	Silver	235.3753	0.9556	0.9568	0.9556	0.9559	40.7655	0.9454	0.9486	0.9454	0.9463
4	Gold	368.6536	0.8959	0.8987	0.8959	0.8964	21.5810	0.9496	0.9511	0.9496	0.9498
5	Gold	141.3825	0.9589	0.9594	0.9589	0.9589	14.3957	0.9674	0.9680	0.9674	0.9674

Testing Metrics

Note: Curriculum indicates which portion of the dataset the model was trained on at the time — Silver = context-only headlines, Gold = explicit company/ticker/sector headlines.

Input Strategy	Curriculum	Accuracy	Precision	Recall	F1 Score
3	Silver	0.8300	0.8472	0.8300	0.8341
3	Gold	0.9477	0.9485	0.9477	0.9479

Class Mappings

{
  "0": "basic materials",
  "1": "communication services",
  "2": "consumer cyclical",
  "3": "consumer defensive",
  "4": "energy",
  "5": "financial services",
  "6": "healthcare",
  "7": "industrials",
  "8": "real estate",
  "9": "technology",
  "10": "unknown",
  "11": "utilities"
}

Use this code to format the headlines before passing them through the model

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load NER pipeline
ner_pipeline = pipeline(
    "ner",
    model="Jean-Baptiste/roberta-large-ner-english",
    aggregation_strategy="simple"
)

# Inference Formatter
def format_headline(headline):
    # Run NER
    ents = ner_pipeline(headline)

    # Buckets for entities
    entity_buckets = {
        "ORG": [],
        "LOC": [],
        "PER": [],
        "GPE": []
    }

    # Fill buckets
    for ent in ents:
        tag = ent["entity_group"]
        text = ent["word"]

        if tag in entity_buckets:
            entity_buckets[tag].append(text)

    # Build prefix with tags 
    prefix = ""

    for tag, items in entity_buckets.items():
        if items:
            prefix += f"[{tag}] " + " | ".join(items) + " "

    # If prefix exists, add [SEP]
    if prefix:
        prefix = prefix.strip() + " [SEP] "

    # Final formatted input
    final_text = prefix + headline
    return final_text


# Example
headline = "Apple expands operations in China to boost iPhone production."
print(format_headline(headline))

Expcted Output

[ORG]  Apple [LOC]  China [SEP] Apple expands operations in China to boost iPhone production.

Live Sector Classification Demo

Try the interactive demo here:

https://huggingface.co/spaces/alemmrr/filbert-gics-sector-classifier-ui

This Space version includes:

Automatic formatting
All sector probability scores
Fully interactive UI

License

This model is released under the Creative Commons Attribution–NonCommercial 4.0 International (CC BY-NC 4.0) license.

You are free to use, share, and modify the model for non-commercial purposes only, as long as you credit the original author. Commercial use of this model or any derivative work is not permitted.

Full license text: https://creativecommons.org/licenses/by-nc/4.0/

Citation

If you use this model in your research, projects, or derivatives, please cite

@misc{finbert_gics_v3,
  title        = {FinBERT-GICS Sector Classifier v3},
  author       = {Morales, Alejandro},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/alemmrr/finbert-gics-sector-classifier-v3}},
  note         = {NER-enhanced FinBERT model for GICS sector classification},
}

Downloads last month: 53

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for alemmrr/finbert-gics-sector-classifier-v3

Base model

ProsusAI/finbert

Finetuned

(77)

this model

alemmrr
/

finbert-gics-sector-classifier-v3