You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Card for Yougen/clips

CLIPS (Contrastive Language-Image Pre-training) 中文多模态模型，基于CLIP架构在大规模中文图文数据集上进行预训练，能够实现文本与图像的跨模态匹配与检索。

Model Details

Model Description

本模型是CLIP架构的中文适配版本，通过对比学习的方式学习文本和图像的联合表示。模型能够将中文文本和图像映射到同一个特征空间，使得语义相似的文本和图像在特征空间中距离相近。该模型可用于图像检索、文本检索、零样本图像分类、图文匹配等多种多模态任务。

Developed by: Yougen Yuan
Funded by [optional]: [More Information Needed]
Shared by [optional]: Yougen Yuan
Model type: Contrastive Language-Image Pre-training (CLIP)
Language(s) (NLP): Chinese (zh)
License: Apache-2.0
Finetuned from model [optional]: [More Information Needed]

Model Sources [optional]

Repository: https://huggingface.co/Yougen/clips
Paper [optional]: [More Information Needed]
Demo [optional]: [More Information Needed]

Uses

Direct Use

零样本图像分类：无需额外训练，直接使用中文文本描述对图像进行分类
图像检索：根据中文文本查询检索相关图像
文本检索：根据图像查询检索相关中文文本
图文相似度计算：计算任意中文文本与图像之间的语义相似度

Downstream Use [optional]

多模态分类任务：在特定领域数据集上微调，实现更精准的图像分类
图像描述生成：结合生成式模型，基于图像生成中文描述
视觉问答(VQA)：结合问答模型，实现基于图像的中文问答
多模态检索系统：构建大规模图文检索引擎

Out-of-Scope Use

不适用于非中文语言的多模态任务
不适用于需要高精度医学影像、卫星影像等专业领域的分析
禁止用于生成或传播违法、有害、歧视性内容
禁止用于未经授权的人脸识别或身份验证系统

Bias, Risks, and Limitations

模型在预训练数据中学习到的社会偏见可能会在预测结果中体现
对于罕见物体、抽象概念和复杂场景的理解能力有限
模型性能受输入图像质量和文本描述准确性的影响较大
在低资源领域和长尾类别上的表现可能不佳
模型不具备因果推理能力，仅能学习数据中的统计相关性

Recommendations

用户 (both direct and downstream) should be made aware of the risks, biases and limitations of the model. 在使用本模型进行关键决策前，建议进行充分的测试和验证。对于高风险应用场景，应结合人工审核和其他验证手段。

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image

# 加载模型和处理器
model = CLIPModel.from_pretrained("Yougen/clips")
processor = CLIPProcessor.from_pretrained("Yougen/clips")

# 准备输入
image = Image.open("example.jpg")
texts = ["一张猫的照片", "一张狗的照片", "一张鸟的照片"]

# 预处理
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

# 前向传播
with torch.no_grad():
    outputs = model(**inputs)

# 计算相似度
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

# 输出结果
for text, prob in zip(texts, probs[0]):
    print(f"{text}: {prob.item():.4f}")

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

Training regime: fp16 mixed precision
Batch size: [More Information Needed]
Learning rate: [More Information Needed]
Epochs: [More Information Needed]
Optimizer: AdamW

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

Top-k 准确率: 用于评估零样本图像分类性能
Recall@k: 用于评估图像检索和文本检索性能
mAP (mean Average Precision): 用于评估检索系统的整体性能

Results

[More Information Needed]

Summary

[More Information Needed]

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

本模型采用CLIP架构，包含两个主要组件：

图像编码器: 基于Vision Transformer (ViT) 架构，将图像转换为固定维度的特征向量
文本编码器: 基于Transformer架构，将中文文本转换为固定维度的特征向量

模型通过对比学习损失函数进行训练，最大化匹配的图文对之间的相似度，最小化不匹配的图文对之间的相似度。

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

Framework: PyTorch
Libraries: transformers, datasets, torchvision
Training Platform: [More Information Needed]

Citation [optional]

BibTeX:

@misc{yuan2026clips,
  author = {Yougen Yuan},
  title = {CLIPS: Chinese Contrastive Language-Image Pre-training Model},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Yougen/clips}}
}

APA:

Yuan, Y. (2026). CLIPS: Chinese Contrastive Language-Image Pre-training Model. Hugging Face. https://huggingface.co/Yougen/clips

Glossary [optional]

CLIP: Contrastive Language-Image Pre-training，一种通过对比学习实现跨模态表示学习的方法
零样本学习: 无需在特定任务数据集上进行训练，直接使用预训练模型完成任务
跨模态检索: 在不同模态（如文本和图像）之间进行信息检索

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

Yougen Yuan

Model Card Contact

[More Information Needed]

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for Yougen/clips

Quantifying the Carbon Emissions of Machine Learning

Paper • 1910.09700 • Published Oct 21, 2019 • 52