Hub documentation
Data Designer
Single Sign-On (SSO) Audit Logs Storage Regions Data Studio for Private datasets Resource Groups (Access Control) Advanced Compute Options Advanced Security Tokens Management Publisher Analytics Gating Group Collections Network Security Rate Limits Blog Articles
PRO Plan Repositories Getting Started with Repositories Repository Settings Storage Limits Storage Backend (Xet) Local Cache Pull Requests & Discussions Notifications Collections Webhooks GitHub Actions Notebooks Next Steps Licenses
Models The Model Hub Model Cards Eval Results Leaderboard Data Gated Models Uploading Models Downloading Models Integrated Libraries Model Widgets Model Inference Models Download Stats Model Release Checklist Local Apps Frequently Asked Questions Advanced Topics
Datasets Datasets Overview Dataset Cards Gated Datasets Uploading Datasets Uploading Datasets (for LLMs) Downloading Datasets Streaming Datasets Integrated Libraries
Spaces Argilla Daft Dask Datasets Data Designer Distilabel DuckDB Embedding Atlas fenic FiftyOne Lance Pandas Polars Spark WebDataset
Data Studio Datasets Download Stats Spaces Overview Spaces GPU Upgrades Spaces ZeroGPU Spaces Dev Mode Spaces Disk Usage & Storage Spaces Custom Domain Spaces as MCP servers Spaces as Agent Tools Spaces as API Endpoints Gradio Spaces Streamlit Spaces Static HTML Spaces Docker Spaces Embed your Space Run Spaces with Docker Spaces Configuration Reference Sign-In with HF button Featured Spaces Spaces Changelog Advanced Topics
Storage Buckets new Jobs Jobs Overview Quickstart Pricing and Billing Manage Jobs Configuration Popular Images Examples & Tutorials Schedule Jobs Webhook Automation Reference
Agents Agents Overview Hugging Face CLI for AI Agents Hugging Face MCP Server Hugging Face Agent Skills Building agents with the HF SDK Local Agents with llama.cpp Agent Libraries
Other Data Designer
Data Designer is NVIDIA NeMo’s framework for generating high-quality synthetic datasets using LLMs. It enables you to create diverse data using statistical samplers, LLMs, or existing seed datasets.
Prerequisites
pip install data-designer
Download datasets from the Hub as seeds
Use HuggingFaceSeedSource to load datasets directly from the Hub as seed data for generation.
import data_designer.config as dd
from data_designer.interface import DataDesigner
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()
# Load seed data from HuggingFace
seed_source = dd.HuggingFaceSeedSource(
path="datasets/gretelai/symptom_to_diagnosis/data/train.parquet",
token="hf_...", # Optional, for private datasets
)
config_builder.with_seed_dataset(seed_source)
# Reference seed columns in prompts
config_builder.add_column(
dd.LLMTextColumnConfig(
name="physician_notes",
model_alias="openai-gpt-5",
prompt="Write notes for a patient with {{ diagnosis }}. Symptoms: {{ patient_summary }}",
)
)
preview = data_designer.preview(config_builder, num_records=5)Push generated datasets to the Hub
Use the built-in push_to_hub method to upload generated datasets to the Hub.
# Generate dataset
results = data_designer.create(config_builder, num_records=1000, dataset_name="my-dataset")
# Push to Hub
url = results.push_to_hub(
repo_id="username/my-synthetic-dataset",
description="Synthetic dataset generated with Data Designer.",
tags=["medical", "notes"],
private=False,
)Resources
- Data Designer Documentation
- GitHub Repository
- Seed Datasets Guide
- Guide to using Data Designer with Inference Providers