Model Card for Yougen/clips

CLIPS (Contrastive Language-Image Pre-training) 中文多模态模型,基于CLIP架构在大规模中文图文数据集上进行预训练,能够实现文本与图像的跨模态匹配与检索。

Model Details

Model Description

本模型是CLIP架构的中文适配版本,通过对比学习的方式学习文本和图像的联合表示。模型能够将中文文本和图像映射到同一个特征空间,使得语义相似的文本和图像在特征空间中距离相近。该模型可用于图像检索、文本检索、零样本图像分类、图文匹配等多种多模态任务。

  • Developed by: Yougen Yuan
  • Funded by [optional]: [More Information Needed]
  • Shared by [optional]: Yougen Yuan
  • Model type: Contrastive Language-Image Pre-training (CLIP)
  • Language(s) (NLP): Chinese (zh)
  • License: Apache-2.0
  • Finetuned from model [optional]: [More Information Needed]

Model Sources [optional]

Uses

Direct Use

  • 零样本图像分类:无需额外训练,直接使用中文文本描述对图像进行分类
  • 图像检索:根据中文文本查询检索相关图像
  • 文本检索:根据图像查询检索相关中文文本
  • 图文相似度计算:计算任意中文文本与图像之间的语义相似度

Downstream Use [optional]

  • 多模态分类任务:在特定领域数据集上微调,实现更精准的图像分类
  • 图像描述生成:结合生成式模型,基于图像生成中文描述
  • 视觉问答(VQA):结合问答模型,实现基于图像的中文问答
  • 多模态检索系统:构建大规模图文检索引擎

Out-of-Scope Use

  • 不适用于非中文语言的多模态任务
  • 不适用于需要高精度医学影像、卫星影像等专业领域的分析
  • 禁止用于生成或传播违法、有害、歧视性内容
  • 禁止用于未经授权的人脸识别或身份验证系统

Bias, Risks, and Limitations

  • 模型在预训练数据中学习到的社会偏见可能会在预测结果中体现
  • 对于罕见物体、抽象概念和复杂场景的理解能力有限
  • 模型性能受输入图像质量和文本描述准确性的影响较大
  • 在低资源领域和长尾类别上的表现可能不佳
  • 模型不具备因果推理能力,仅能学习数据中的统计相关性

Recommendations

用户 (both direct and downstream) should be made aware of the risks, biases and limitations of the model. 在使用本模型进行关键决策前,建议进行充分的测试和验证。对于高风险应用场景,应结合人工审核和其他验证手段。

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image

# 加载模型和处理器
model = CLIPModel.from_pretrained("Yougen/clips")
processor = CLIPProcessor.from_pretrained("Yougen/clips")

# 准备输入
image = Image.open("example.jpg")
texts = ["一张猫的照片", "一张狗的照片", "一张鸟的照片"]

# 预处理
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

# 前向传播
with torch.no_grad():
    outputs = model(**inputs)

# 计算相似度
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

# 输出结果
for text, prob in zip(texts, probs[0]):
    print(f"{text}: {prob.item():.4f}")

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

  • Training regime: fp16 mixed precision
  • Batch size: [More Information Needed]
  • Learning rate: [More Information Needed]
  • Epochs: [More Information Needed]
  • Optimizer: AdamW

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

  • Top-k 准确率: 用于评估零样本图像分类性能
  • Recall@k: 用于评估图像检索和文本检索性能
  • mAP (mean Average Precision): 用于评估检索系统的整体性能

Results

[More Information Needed]

Summary

[More Information Needed]

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: [More Information Needed]
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

本模型采用CLIP架构,包含两个主要组件:

  • 图像编码器: 基于Vision Transformer (ViT) 架构,将图像转换为固定维度的特征向量
  • 文本编码器: 基于Transformer架构,将中文文本转换为固定维度的特征向量

模型通过对比学习损失函数进行训练,最大化匹配的图文对之间的相似度,最小化不匹配的图文对之间的相似度。

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

  • Framework: PyTorch
  • Libraries: transformers, datasets, torchvision
  • Training Platform: [More Information Needed]

Citation [optional]

BibTeX:

@misc{yuan2026clips,
  author = {Yougen Yuan},
  title = {CLIPS: Chinese Contrastive Language-Image Pre-training Model},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Yougen/clips}}
}

APA:

Yuan, Y. (2026). CLIPS: Chinese Contrastive Language-Image Pre-training Model. Hugging Face. https://huggingface.co/Yougen/clips

Glossary [optional]

  • CLIP: Contrastive Language-Image Pre-training,一种通过对比学习实现跨模态表示学习的方法
  • 零样本学习: 无需在特定任务数据集上进行训练,直接使用预训练模型完成任务
  • 跨模态检索: 在不同模态(如文本和图像)之间进行信息检索

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

Yougen Yuan

Model Card Contact

[More Information Needed]

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Yougen/clips