Sentence Similarity
PyTorch
English
taylor-m1

constructai/taylor-m1


This model is a fine-tuned version of a custom BERT-like encoder (hidden_size=384, 6 layers) trained on MS MARCO passage ranking dataset.
It uses triplet hard negatives from sentence-transformers/msmarco-msmarco-distilbert-base-v3 (Apache-2.0 license). The base MS MARCO data is subject to Microsoft Research License.

The model produces 384-dimensional embeddings (CLS token) optimized for cosine similarity.


Training details

  • ~22 417 920M parameters

  • Vocab Size: 30 522

  • Max sequence length: 128 tokens

  • Loss: MultipleNegativesSymmetricRankingLoss (InfoNCE)

  • Batch size: 128 (32 + gradient accumulation)

  • Learning rate: 2e-5

  • Data: ~500k triplets from MS MARCO


Evaluation

On a small test set, the model achieves:

  • Positive pair similarity: 0.58

  • Negative pair similarity: 0.14

  • Margin: 0.44


Usage

Option 1: Via custom Python package (recommended)

Install the package directly from GitHub:

pip install git+https://github.com/PSYCHOxSPEED/constructai-taylor-model

Then load the model and get embeddings:

from taylor_model import load_taylor_model, embed_texts
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model, tokenizer, _ = load_taylor_model("constructai/taylor-m1", device=device)

texts = ["What is a neural network?", "How to make pizza at home?"]
embeddings = embed_texts(model, tokenizer, texts, device=device)
print(embeddings.shape)  # (2, 384)

Compute similarity between queries and documents:

from taylor_model import load_taylor_model, embed_texts
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model, tokenizer, _ = load_taylor_model("constructai/taylor-m1", device=device)

queries = [
    "What is a neural network?",
    "How to make pizza at home?"
]
documents = [
    "A neural network is a computing system inspired by biological neural networks.",
    "For pizza you need dough, tomato sauce, mozzarella cheese and toppings."
]

q_emb = embed_texts(model, tokenizer, queries, device=device)
d_emb = embed_texts(model, tokenizer, documents, device=device)

similarities = q_emb @ d_emb.T
print("Similarity matrix:")
print(similarities)

Requirements:

  • Python ≥ 3.9
  • transformers ≥ 4.30.0
  • torch ≥ 2.0.0
  • huggingface_hub ≥ 0.20.0
  • numpy

Model details

This model was fully created by me (Construct AI). I designed the architecture, trained a custom WordPiece tokenizer, pre‑trained the model with MLM, and fine‑tuned it on the MS MARCO passage ranking dataset for semantic search. No parts of this model have been taken from other pre‑trained checkpoints – it is built from scratch.


License

This model is released under the Apache 2.0 License.


I apologize if this model does not show the best quality or if you are unhappy with its maximum sequence length.

This is my first custom model, I tried not to do everything at once.

Downloads last month
40
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train constructai/taylor-m1

Collection including constructai/taylor-m1