CodeGemma 7B SecureCode

Security-specialized code model fine-tuned on the SecureCode dataset

Dataset | Paper (arXiv:2512.18542) | Model Collection | perfecXion.ai

What This Model Does

This model generates secure code when developers ask about building features. Instead of producing vulnerable implementations (like 45% of AI-generated code does), it:

Identifies the security risks in common coding patterns
Provides vulnerable and secure implementations side by side
Explains how attackers would exploit the vulnerability
Includes defense-in-depth guidance: logging, monitoring, SIEM integration, infrastructure hardening

The model was fine-tuned on 2,185 security training examples covering both traditional web security (OWASP Top 10 2021) and AI/ML security (OWASP LLM Top 10 2025).

Model Details


Base Model	CodeGemma 7B IT
Parameters	7B
Architecture	Gemma
Tier	Tier 2: Mid-size Code Specialist
Method	QLoRA (4-bit NormalFloat quantization)
LoRA Rank	16 (alpha=32)
Target Modules	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` (7 modules)
Training Data	scthornton/securecode (2,185 examples)
Hardware	NVIDIA A100 40GB

Google's code-specialized Gemma variant. Strong instruction following with efficient architecture.

Quick Start

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Load with 4-bit quantization (matches training)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    "google/codegemma-7b-it",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("scthornton/codegemma-7b-securecode")
model = PeftModel.from_pretrained(base_model, "scthornton/codegemma-7b-securecode")

# Ask a security-relevant coding question
messages = [
    {"role": "user", "content": "How do I implement JWT authentication with refresh tokens in Python?"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Dataset

Trained on the full SecureCode unified dataset:

2,185 total examples (1,435 web security + 750 AI/ML security)
20 vulnerability categories across OWASP Top 10 2021 and OWASP LLM Top 10 2025
12+ programming languages and 49+ frameworks
4-turn conversational structure: feature request, vulnerable/secure implementations, advanced probing, operational guidance
100% incident grounding: every example tied to real CVEs, vendor advisories, or published attack research

Hyperparameters

Parameter	Value
LoRA rank	16
LoRA alpha	32
LoRA dropout	0.05
Target modules	7 linear layers
Quantization	4-bit NormalFloat (NF4)
Learning rate	2e-4
LR scheduler	Cosine with 100-step warmup
Epochs	3
Per-device batch size	2
Gradient accumulation	8x
Effective batch size	16
Max sequence length	4096 tokens
Optimizer	paged_adamw_8bit
Precision	bf16

Notes: Requires trust_remote_code=True. Extended 4096-token context for full security conversations.

Security Coverage

Web Security (1,435 examples)

OWASP Top 10 2021: Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Misconfiguration, Vulnerable Components, Authentication Failures, Software Integrity Failures, Logging/Monitoring Failures, SSRF.

Languages: Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, YAML.

AI/ML Security (750 examples)

OWASP LLM Top 10 2025: Prompt Injection, Sensitive Information Disclosure, Supply Chain Vulnerabilities, Data/Model Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector/Embedding Weaknesses, Misinformation, Unbounded Consumption.

Frameworks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, ChromaDB, Pinecone, FastAPI, Flask, vLLM, CrewAI, and 30+ more.

SecureCode Model Collection

This model is part of the SecureCode collection of 8 security-specialized models:

Model	Base	Size	Tier	HuggingFace
Llama 3.2 SecureCode	meta-llama/Llama-3.2-3B-Instruct	3B	Accessible	`llama-3.2-3b-securecode`
Qwen2.5 Coder SecureCode	Qwen/Qwen2.5-Coder-7B-Instruct	7B	Mid-size	`qwen2.5-coder-7b-securecode`
DeepSeek Coder SecureCode	deepseek-ai/deepseek-coder-6.7b-instruct	6.7B	Mid-size	`deepseek-coder-6.7b-securecode`
CodeGemma SecureCode	google/codegemma-7b-it	7B	Mid-size	`codegemma-7b-securecode`
CodeLlama SecureCode	codellama/CodeLlama-13b-Instruct-hf	13B	Large	`codellama-13b-securecode`
Qwen2.5 Coder 14B SecureCode	Qwen/Qwen2.5-Coder-14B-Instruct	14B	Large	`qwen2.5-coder-14b-securecode`
StarCoder2 SecureCode	bigcode/starcoder2-15b-instruct-v0.1	15B	Large	`starcoder2-15b-securecode`
Granite 20B Code SecureCode	ibm-granite/granite-20b-code-instruct-8k	20B	XL	`granite-20b-code-securecode`

Choose based on your deployment constraints: 3B for edge/mobile, 7B for general use, 13B-15B for deeper reasoning, 20B for maximum capability.

SecureCode Dataset Family

Dataset	Examples	Focus	Link
SecureCode	2,185	Unified (web + AI/ML)	scthornton/securecode
SecureCode Web	1,435	Web security (OWASP Top 10 2021)	scthornton/securecode-web
SecureCode AI/ML	750	AI/ML security (OWASP LLM Top 10 2025)	scthornton/securecode-aiml

Intended Use

Use this model for:

Training AI coding assistants to write secure code
Security education and training
Vulnerability research and secure code review
Building security-aware development tools

Do not use this model for:

Offensive exploitation or automated attack generation
Circumventing security controls
Any activity that violates the base model's license

Citation

@misc{thornton2026securecode,
  title={SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models},
  author={Thornton, Scott},
  year={2026},
  publisher={perfecXion.ai},
  url={https://huggingface.co/datasets/scthornton/securecode},
  note={arXiv:2512.18542}
}

License

This model is released under the gemma license (inherited from the base model). The training dataset (SecureCode) is licensed under CC BY-NC-SA 4.0.

Downloads last month: 8

Model tree for scthornton/codegemma-7b-securecode

Base model

google/codegemma-7b-it

Adapter

(6)

this model

Dataset used to train scthornton/codegemma-7b-securecode

Collection including scthornton/codegemma-7b-securecode

SecureCode

Collection

11 items • Updated Feb 11 • 3

Paper for scthornton/codegemma-7b-securecode

SecureCode v2.0: A Production-Grade Dataset for Training Security-Aware Code Generation Models

Paper • 2512.18542 • Published Dec 20, 2025 • 4

scthornton
/

codegemma-7b-securecode