ModernBERT-base Chunk Classifier β€” Funding Statement Localization

A binary classifier on top of answerdotai/ModernBERT-base that scores a single 8,192-token chunk of an academic paper for the presence of a funding statement. Used as stage 1 of a three-stage funding-extraction cascade to narrow a long PDF down to the most-likely chunk before running expensive span-extraction and cleanup.

The full cascade:

  1. Stage 1 (this model): For each ≀8,192-token chunk of the paper, predict a scalar P(this chunk contains a funding statement). Take top-K chunks above a threshold (we use top-2 above 0.4).
  2. Stage 2 β€” span head: cometadata/funding-extraction-modernbert-base-spanhead β€” picks the exact start/end token within the top chunk.
  3. Stage 3 β€” cleanup LoRA: cometadata/funding-cleaning-qwen3-4b-lora β€” strips LaTeX markers and normalizes whitespace in the extracted span.

You can use this model standalone if you only need to flag whether a chunk (or doc) contains funding language at all (binary F1 0.97 on the test set).

Architecture

The architecture is a custom ChunkClassifier module (included in modeling.py):

import torch.nn as nn
from transformers import AutoModel


class ChunkClassifier(nn.Module):
    """ModernBERT encoder + mean-pool + binary head."""

    def __init__(self, base="answerdotai/ModernBERT-base"):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(base)
        self.head = nn.Linear(self.encoder.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        # Mean pool over real (non-padding) tokens
        mask = attention_mask.unsqueeze(-1).float()
        pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1)
        return self.head(pooled).squeeze(-1)   # one logit per chunk

Use

import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
from modeling import ChunkClassifier  # bundled in this repo

REPO = "cometadata/funding-chunk-classifier-modernbert-base"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(REPO)
model = ChunkClassifier("answerdotai/ModernBERT-base").to(device)
state_dict = torch.load(
    hf_hub_download(REPO, "pytorch_model.bin"),
    map_location=device, weights_only=True,
)
model.load_state_dict(state_dict)
model.eval()

# For a long paper, slide an 8192-token window with stride 4096.
def chunks_of(text, max_tok=8192, stride=4096):
    enc = tokenizer(text, add_special_tokens=False, truncation=False)
    ids = enc["input_ids"]
    if len(ids) <= max_tok:
        yield ids, 0, len(ids)
        return
    for st in range(0, len(ids), stride):
        en = min(st + max_tok, len(ids))
        yield ids[st:en], st, en
        if en == len(ids):
            break

probs = []
for chunk_ids, st, en in chunks_of(paper_text):
    ids_t = torch.tensor(chunk_ids).unsqueeze(0).to(device)
    attn = torch.ones_like(ids_t)
    with torch.no_grad():
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            logit = model(ids_t, attn).float()
    probs.append((torch.sigmoid(logit).item(), st, en))

# Top-K chunks above threshold
top_k = sorted(probs, key=lambda p: -p[0])[:2]
top_k = [p for p in top_k if p[0] >= 0.4]
# `top_k` is the list to hand off to the span-head model.

Training data

Built from the 2,384 training rows of cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test.

For each train doc:

  • Tokenize vlm_markdown with the ModernBERT tokenizer.
  • Slide an 8,192-token window with stride 4,096 over the tokenized doc.
  • For each chunk, label 1 iff the gold funding statement (located via verbatim substring or rapidfuzz.partial_ratio_alignment β‰₯ 0.7) overlaps the chunk's character range by more than half its length, else 0.

Negative docs (no funding statement) contribute negative chunks; positive docs contribute one positive chunk (the one containing the gold) plus several negative chunks from the rest of the doc, so the negative class is naturally dominant (~9Γ— more negatives than positives).

Final training set: roughly 21,000 chunks (~2,300 positive / ~18,700 negative).

Loss

Binary cross-entropy with pos_weight = n_examples / n_positives to counteract the class imbalance:

loss_fn = nn.BCEWithLogitsLoss(pos_weight=torch.tensor(n_examples / n_positives))
loss = loss_fn(logits, labels)

Hyperparameters

  • Base: answerdotai/ModernBERT-base (149M, 8,192-token context)
  • Optimizer: AdamW, lr 5e-5, weight decay 0.01
  • Schedule: linear warmup (20 steps) + cosine decay
  • Epochs: 3
  • Batch: 2 per device Γ— 8 grad accum = 16 effective
  • Mixed precision: bfloat16
  • Max sequence: 8,192 tokens
  • Trained on 1Γ— H100 80GB
  • Saved checkpoint: pytorch_model.bin is the epoch-2 (final) state dict

Evaluation

On the 597-row test split of cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test, treated as a per-document binary task (does the doc have any funding statement?): we score each candidate chunk and use the max probability as the document-level prediction. Threshold = 0.5.

Metric Precision Recall F1 F0.5
Doc-level funding detection 0.9831 0.9537 0.9682 0.9771

Sub-stats at threshold 0.5: TP=350, FP=6, FN=17, TN=224.

Chunk-recall caveat: even when the doc-level prediction is correct, the top-1 chunk contains the gold statement verbatim only ~68% of the time (top-2 covers ~88%). This is why the downstream cascade uses top-K=2 chunks: it raises the chance that the gold-containing chunk is fed to the span head.

Intended use

Doc-level filtering of arXiv-derived PDFs for funding-statement presence, and stage-1 of the funding-extraction cascade. Useful when you want to skip expensive span extraction on most papers (a sizable fraction of arXiv papers have no funding statement).

Not intended for: extraction (it only classifies chunks; pair with the span-head model for spans), classification of funding sources, or text outside the academic-paper domain.

Limitations

  • Trained only on arXiv-derived PDFs; behavior on other paper sources is untested.
  • Top-1 chunk is wrong ~32% of the time even when doc-level is correct. Use top-K β‰₯ 2 if you need recall.
  • Mean-pooling over 8,192 tokens dilutes the signal from a short (~272-char-median) funding statement β€” the false-negative rate at strict threshold 0.9 is non-trivial. Use 0.5 (or lower) and rely on the span head's no_answer head to suppress empty chunks.

Citation / acknowledgement

Trained as part of an applied research cycle on the cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test dataset by Comet.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cometadata/funding-chunk-classifier-modernbert-base

Finetuned
(1260)
this model

Collection including cometadata/funding-chunk-classifier-modernbert-base