| | --- |
| | library_name: transformers |
| | tags: |
| | - cybersecurity |
| | - mpnet |
| | - classification |
| | - fine-tuned |
| | language: |
| | - en |
| | base_model: |
| | - sentence-transformers/all-mpnet-base-v2 |
| | --- |
| | |
| | # AttackGroup-MPNET - Model Card for MPNet Cybersecurity Classifier |
| |
|
| | This is a fine-tuned MPNet model specialized for classifying cybersecurity threat groups based on textual descriptions of their tactics and techniques. |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | This model is a fine-tuned MPNet classifier specialized in categorizing cybersecurity threat groups based on textual descriptions of their tactics, techniques, and procedures (TTPs). |
| |
|
| | - **Developed by:** Dženan Hamzić |
| | - **Model type:** Transformer-based classification model (MPNet) |
| | - **Language(s) (NLP):** English |
| | - **License:** Apache-2.0 |
| | - **Finetuned from model:** microsoft/mpnet-base (with intermediate MLM fine-tuning) |
| |
|
| | ### Model Sources |
| |
|
| | - **Base Model:** [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base) |
| |
|
| | ## Uses |
| |
|
| | ### Direct Use |
| |
|
| | This model classifies textual cybersecurity descriptions into known cybersecurity threat groups. |
| |
|
| | ### Downstream Use |
| |
|
| | Integration into Cyber Threat Intelligence platforms, SOC incident analysis tools, and automated threat detection systems. |
| |
|
| | ### Out-of-Scope Use |
| |
|
| | - General language tasks unrelated to cybersecurity |
| | - Tasks outside the cybersecurity domain |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | This model specializes in cybersecurity contexts. Predictions for unrelated contexts may be inaccurate. |
| |
|
| | ### Recommendations |
| |
|
| | Always verify predictions with cybersecurity analysts before using in critical decision-making scenarios. |
| |
|
| | ## How to Get Started with the Model (Classification) |
| |
|
| | ```python |
| | import torch |
| | import torch.nn as nn |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | import torch.optim as optim |
| | import numpy as np |
| | from huggingface_hub import hf_hub_download |
| | import json |
| | |
| | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| | |
| | |
| | label_to_groupid_file = hf_hub_download( |
| | repo_id="selfconstruct3d/AttackGroup-MPNET", |
| | filename="label_to_groupid.json" |
| | ) |
| | |
| | with open(label_to_groupid_file, "r") as f: |
| | label_to_groupid = json.load(f) |
| | |
| | # Load explicitly your fine-tuned MPNet model |
| | classifier_model = AutoModelForSequenceClassification.from_pretrained("selfconstruct3d/AttackGroup-MPNET", num_labels=len(label_to_groupid)).to(device) |
| | |
| | # Load explicitly your tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET") |
| | |
| | def predict_group(sentence): |
| | classifier_model.eval() |
| | encoding = tokenizer( |
| | sentence, |
| | truncation=True, |
| | padding="max_length", |
| | max_length=128, |
| | return_tensors="pt" |
| | ) |
| | input_ids = encoding["input_ids"].to(device) |
| | attention_mask = encoding["attention_mask"].to(device) |
| | |
| | with torch.no_grad(): |
| | outputs = classifier_model(input_ids=input_ids, attention_mask=attention_mask) |
| | logits = outputs.logits |
| | predicted_label = torch.argmax(logits, dim=1).cpu().item() |
| | |
| | predicted_groupid = label_to_groupid[str(predicted_label)] |
| | return predicted_groupid |
| | |
| | # Example usage explicitly: |
| | sentence = "APT38 has used phishing emails with malicious links to distribute malware." |
| | predicted_class = predict_group(sentence) |
| | print(f"Predicted GroupID: {predicted_class}") |
| | ``` |
| | Predicted GroupID: G0001 |
| | https://attack.mitre.org/groups/G0001/ |
| |
|
| |
|
| | ## How to Get Started with the Model (Embeddings) |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | from huggingface_hub import hf_hub_download |
| | import json |
| | |
| | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| | |
| | |
| | label_to_groupid_file = hf_hub_download( |
| | repo_id="selfconstruct3d/AttackGroup-MPNET", |
| | filename="label_to_groupid.json" |
| | ) |
| | |
| | with open(label_to_groupid_file, "r") as f: |
| | label_to_groupid = json.load(f) |
| | |
| | |
| | # Load your fine-tuned classification model |
| | model_name = "selfconstruct3d/AttackGroup-MPNET" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | classifier_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(label_to_groupid)).to(device) |
| | |
| | def get_embedding(sentence): |
| | classifier_model.eval() |
| | |
| | encoding = tokenizer( |
| | sentence, |
| | truncation=True, |
| | padding="max_length", |
| | max_length=128, |
| | return_tensors="pt" |
| | ) |
| | input_ids = encoding["input_ids"].to(device) |
| | attention_mask = encoding["attention_mask"].to(device) |
| | |
| | with torch.no_grad(): |
| | outputs = classifier_model.mpnet(input_ids=input_ids, attention_mask=attention_mask) |
| | cls_embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy().flatten() |
| | |
| | return cls_embedding |
| | |
| | # Example explicitly: |
| | sentence = "APT38 has used phishing emails with malicious links to distribute malware." |
| | embedding = get_embedding(sentence) |
| | print("Embedding shape:", embedding.shape) |
| | print("Embedding values:", embedding) |
| | ``` |
| |
|
| |
|
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | To be anounced... |
| |
|
| | ### Training Procedure |
| |
|
| | - Fine-tuned from: MLM fine-tuned MPNet ("mpnet_mlm_cyber_finetuned-v2") |
| | - Epochs: 32 |
| | - Learning rate: 5e-6 |
| | - Batch size: 16 |
| | |
| | ## Evaluation |
| | |
| | ### Testing Data, Factors & Metrics |
| | |
| | - **Testing Data:** Stratified sample from original dataset. |
| | - **Metrics:** Accuracy, Weighted F1 Score |
| | |
| | ### Results |
| | |
| | | Metric | Value | |
| | |------------------------|---------| |
| | | Cl. Accuracy (Test) | 0.9564 | |
| | | W. F1 Score (Test) | 0.9577 | |
| | |
| | |
| | ## Evaluation Results |
| | |
| | | Model | Accuracy | F1 Macro | F1 Weighted | Embedding Variability | |
| | |-----------------------|----------|----------|-------------|-----------------------| |
| | | **AttackGroup-MPNET** | **0.85** | **0.759**| **0.847** | 0.234 | |
| | | GTE Large | 0.66 | 0.571 | 0.667 | 0.266 | |
| | | E5 Large v2 | 0.64 | 0.541 | 0.650 | 0.355 | |
| | | Original MPNet | 0.63 | 0.534 | 0.619 | 0.092 | |
| | | BGE Large | 0.53 | 0.418 | 0.519 | 0.366 | |
| | | SupSimCSE | 0.50 | 0.373 | 0.479 | 0.227 | |
| | | MLM Fine-tuned MPNet | 0.44 | 0.272 | 0.411 | 0.125 | |
| | | SecBERT | 0.41 | 0.315 | 0.410 | 0.591 | |
| | | SecureBERT_Plus | 0.36 | 0.252 | 0.349 | 0.267 | |
| | | CySecBERT | 0.34 | 0.235 | 0.323 | 0.229 | |
| | | ATTACK-BERT | 0.33 | 0.240 | 0.316 | 0.096 | |
| | | Secure_BERT | 0.00 | 0.000 | 0.000 | 0.007 | |
| | | CyBERT | 0.00 | 0.000 | 0.000 | 0.015 | |
| | |
| | |
| | | Model | Similarity Search Recall@5 | Few-shot Accuracy | In-dist Similarity | OOD Similarity | Robustness Similarity | |
| | |----------------------|----------------------------|-------------------|--------------------|----------------|-----------------------| |
| | | **AttackGroup-MPNET**| **0.934** | **0.857** | 0.235 | 0.017 | 0.948 | |
| | | Original MPNet | 0.786 | 0.643 | 0.217 | -0.004 | 0.941 | |
| | | E5 Large v2 | 0.778 | 0.679 | 0.727 | 0.013 | 0.977 | |
| | | GTE Large | 0.746 | 0.786 | 0.845 | 0.002 | 0.984 | |
| | | BGE Large | 0.632 | 0.750 | 0.533 | -0.006 | 0.970 | |
| | | SupSimCSE | 0.616 | 0.571 | 0.683 | -0.015 | 0.978 | |
| | | SecBERT | 0.468 | 0.429 | 0.586 | -0.001 | 0.970 | |
| | | CyBERT | 0.452 | 0.250 | 1.000 | -0.001 | 1.000 | |
| | | ATTACK-BERT | 0.362 | 0.571 | 0.157 | -0.005 | 0.950 | |
| | | CySecBERT | 0.424 | 0.500 | 0.734 | -0.015 | 0.954 | |
| | | Secure_BERT | 0.424 | 0.250 | 0.990 | 0.050 | 0.998 | |
| | | SecureBERT_Plus | 0.406 | 0.464 | 0.981 | 0.040 | 0.998 | |
| | |
| | |
| | |
| | ### Single Prediction Example |
| | |
| | ```python |
| | |
| | import torch |
| | import torch.nn as nn |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | import torch.optim as optim |
| | import numpy as np |
| | from huggingface_hub import hf_hub_download |
| | import json |
| |
|
| | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| | # Load explicitly your fine-tuned MPNet model |
| | classifier_model = AutoModelForSequenceClassification.from_pretrained("selfconstruct3d/AttackGroup-MPNET").to(device) |
| | |
| | # Load explicitly your tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET") |
| |
|
| |
|
| | label_to_groupid_file = hf_hub_download( |
| | repo_id="selfconstruct3d/AttackGroup-MPNET", |
| | filename="label_to_groupid.json" |
| | ) |
| | |
| | with open(label_to_groupid_file, "r") as f: |
| | label_to_groupid = json.load(f) |
| | |
| | def predict_group(sentence): |
| | classifier_model.eval() |
| | encoding = tokenizer( |
| | sentence, |
| | truncation=True, |
| | padding="max_length", |
| | max_length=128, |
| | return_tensors="pt" |
| | ) |
| | input_ids = encoding["input_ids"].to(device) |
| | attention_mask = encoding["attention_mask"].to(device) |
| | |
| | with torch.no_grad(): |
| | outputs = classifier_model(input_ids=input_ids, attention_mask=attention_mask) |
| | logits = outputs.logits |
| | predicted_label = torch.argmax(logits, dim=1).cpu().item() |
| | |
| | predicted_groupid = label_to_groupid[str(predicted_label)] |
| | return predicted_groupid |
| | |
| | # Example usage explicitly: |
| | sentence = "APT38 has used phishing emails with malicious links to distribute malware." |
| | predicted_class = predict_group(sentence) |
| | print(f"Predicted GroupID: {predicted_class}") |
| | ``` |
| | |
| | ## Environmental Impact |
| | |
| | Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute). |
| | |
| | - **Hardware Type:** [To be filled by user] |
| | - **Hours used:** [To be filled by user] |
| | - **Cloud Provider:** [To be filled by user] |
| | - **Compute Region:** [To be filled by user] |
| | - **Carbon Emitted:** [To be filled by user] |
| | |
| | ## Technical Specifications |
| | |
| | ### Model Architecture |
| | |
| | - MPNet architecture with classification head (768 -> 512 -> num_labels) |
| | - Last 10 transformer layers fine-tuned explicitly |
| |
|
| | ## Environmental Impact |
| |
|
| | Carbon emissions should be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute). |
| |
|
| | ## Model Card Authors |
| |
|
| | - Dženan Hamzić |
| |
|
| | ## Model Card Contact |
| |
|
| | - https://www.linkedin.com/in/dzenan-hamzic/ |
| |
|
| |
|
| | ## Licence |
| | This model is licensed for non-commercial use only (CC BY-NC 4.0). |
| | For commercial inquiries, please contact dzenan.hamzic@ait.ac.at. |