datause-extraction

This repository contains the fine-tuned LoRA adapter weights for dataset mention extraction, trained on top of the base model fastino/gliner2-large-v1.

It classifies spans into three categories:

  1. named_data: Proper named datasets, surveys, censuses, or registries (e.g., Demographic and Health Survey, LFS, UNHCR PRIMES).
  2. descriptive_data: Data resources described by their producer or characteristics rather than a proper name (e.g., World Bank household surveys, spatial socioeconomic data sets).
  3. vague_data: General references containing a data noun but lacking enough specificity to identify the exact source (e.g., administrative data, project statistics).

Rationale and Context: Forced Displacement, Refugees, and FCV

In Fragile, Conflict, and Violence (FCV) settings, monitoring the utilization of datasets is crucial for coordinating developmental and humanitarian aid. Research on forced displacement and refugee integration relies heavily on specific household surveys, operational registries, and geographic vulnerability datasets.

By automating the extraction of these references from project documents, appraisal papers, and academic studies, this model helps map data usage, highlights under-analyzed areas, and evaluates the policy impact of statistical capacity investments.


Data Sources & Domain Coverage

The model is specialized in the socio-economic development and forced displacement domains, with strong representation of:

  • Humanitarian Registries & Briefs: UNHCR registration databases (PRIMES), Refugee Socio-Economic Inclusion Surveys (SEIS), Durable Solutions reports, and Protection Monitoring tools.
  • Development Economics & Surveys: World Bank Project Appraisal Documents (PADs), Living Standards Measurement Study (LSMS), Demographic and Health Surveys (DHS), Multiple Indicator Cluster Surveys (MICS), and national censuses.
  • FCV/Geospatial Data: Livelihood surveys, cash-based intervention tracking, and geographic data (e.g., Shuttle Radar Topography Mission, flood hazard mapping, population distribution layers).

Model Performance

The adapter was evaluated on the canonical layout-aware, project-purged Holdout v10 dataset (flat_ner_holdout_v10.jsonl / ai4data/datause-holdout) at a confidence threshold of 0.40 (Jaccard matching threshold = 0.50):

Evaluation Set TP FP FN Precision Recall F0.5 Score
Positive-Only Records (465 chunks w/ mentions) 576 51 152 91.9% 79.1% 0.8900
All Records (Full set of 1,149 chunks) 576 136 152 80.9% 79.1% 0.8054

How to Use

You can load and use this model either via the direct gliner2 library interface or using the high-level ai4data library wrappers.

Option 1: Using the ai4data Library (Recommended)

The ai4data python package automatically handles base model initialization, adapter downloads, token chunking, and post-filtering:

from ai4data import extract_from_text

text = (
    "To analyze the impact of infrastructure spillovers, we combine data from the "
    "2010 Ghana Living Standards Survey (GLSS) with production records for 17 "
    "large-scale gold mines."
)

# Extract dataset mentions using this specific adapter
result = extract_from_text(
    text,
    adapter_id="ai4data/datause-extraction",
    include_confidence=True
)

for ds in result.get("datasets", []):
    print(f"Dataset:    {ds['dataset_name']}")
    print(f"Confidence: {ds['dataset_confidence']:.3f}")
    print(f"Section:    {ds['section_context']}")
    print("-" * 30)

Option 2: Using the Raw gliner2 Interface

If you are using the raw weights directly as a LoRA adapter, you must load the base model (fastino/gliner2-large-v1) first and apply the adapter:

import torch
from gliner2 import GLiNER2
from huggingface_hub import snapshot_download

# 1. Initialize base model
kwargs = {}
if torch.cuda.is_available():
    kwargs["map_location"] = "cuda"
elif torch.backends.mps.is_available():
    kwargs["map_location"] = "mps"
else:
    kwargs["map_location"] = "cpu"

model = GLiNER2.from_pretrained("fastino/gliner2-large-v1", **kwargs)

# 2. Download and apply the LoRA adapter weights
adapter_path = snapshot_download("ai4data/datause-extraction")
model.load_adapter(adapter_path)

# 3. Perform inference
text = (
    "To analyze the impact of infrastructure spillovers, we combine data from the "
    "2010 Ghana Living Standards Survey (GLSS) with production records for 17 "
    "large-scale gold mines."
)

labels = ["named_data", "descriptive_data", "vague_data"]
predictions = model.predict_entities(text, labels, threshold=0.40)

for entity in predictions:
    print(f"Text:  {entity['text']} | Label: {entity['label']} | Score: {entity['score']:.3f}")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ai4data/datause-extraction

Adapter
(13)
this model

Datasets used to train ai4data/datause-extraction

Space using ai4data/datause-extraction 1

Collection including ai4data/datause-extraction