| | --- |
| | license: mit |
| | tags: |
| | - codellama |
| | - linux |
| | - bugfix |
| | - lora |
| | - qlora |
| | - git-diff |
| | base_model: codellama/CodeLLaMA-7b-Instruct-hf |
| | model_type: LlamaForCausalLM |
| | library_name: peft |
| | pipeline_tag: text-generation |
| |
|
| | model-index: |
| | - name: CodeLLaMA-Linux-BugFix |
| | results: |
| | - task: |
| | type: text-generation |
| | name: Bug-fix Patch Generation |
| | dataset: |
| | type: custom |
| | name: Linux Kernel Bugfix Commits |
| | config: linux-bugfix-prompt-completion |
| | split: test |
| | metrics: |
| | - type: bleu |
| | value: 33.87 |
| | name: BLEU |
| | - type: rouge1 |
| | value: 0.4355 |
| | name: ROUGE-1 F1 |
| | - type: rouge2 |
| | value: 0.3457 |
| | name: ROUGE-2 F1 |
| | - type: rougeL |
| | value: 0.3612 |
| | name: ROUGE-L F1 |
| | --- |
| | |
| | # CodeLLaMA-Linux-BugFix |
| |
|
| | A fine-tuned version of `CodeLLaMA-7B-Instruct`, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages. |
| |
|
| | --- |
| |
|
| | ## π― Overview |
| |
|
| | This project targets automated Linux kernel bug fixing by: |
| |
|
| | - **Mining real commit data** from the kernel Git history |
| | - **Training a specialized QLoRA model** on diff-style fixes |
| | - **Generating Git patches** in response to bug-prone code |
| | - **Evaluating results** using BLEU, ROUGE, and human inspection |
| |
|
| | The model achieves strong performance in generating accurate Linux kernel bug fixes, making it a valuable tool for automated code review and bug detection. |
| |
|
| | --- |
| |
|
| | ## π Performance Results |
| |
|
| | ### Evaluation Metrics |
| |
|
| | β
**BLEU Score**: 33.87 |
| |
|
| | β
**ROUGE Scores**: |
| | - **ROUGE-1**: P=0.3775, R=0.7306, F1=0.4355 |
| | - **ROUGE-2**: P=0.2898, R=0.6096, F1=0.3457 |
| | - **ROUGE-L**: P=0.3023, R=0.6333, F1=0.3612 |
| |
|
| | These results demonstrate the model's ability to: |
| | - Generate syntactically correct Git diff patches |
| | - Maintain semantic similarity to reference fixes |
| | - Produce meaningful code changes that address the underlying bugs |
| |
|
| | --- |
| |
|
| | ## π§ Model Configuration |
| |
|
| | - **Base model**: `CodeLLaMA-7B-Instruct` |
| | - **Fine-tuning method**: QLoRA with 4-bit quantization |
| | - **Training setup**: |
| | - LoRA r=64, alpha=16, dropout=0.1 |
| | - Batch size: 64, LR: 2e-4, Epochs: 3 |
| | - Mixed precision (bfloat16), gradient checkpointing |
| | - **Hardware**: Optimized for NVIDIA H200 GPUs |
| |
|
| | --- |
| |
|
| | ## π Training Progress |
| | The model was trained for 1000 steps with the following key metrics: |
| | ### Training Results |
| | - **Final Loss**: ~0.3335 (converged) |
| | - **Final Learning Rate**: 2.08304527802282E-06 |
| | - **Training Steps**: 1000 |
| | - **Convergence**: Stable loss plateau achieved |
| | ### Training Curves |
| |  |
| | *Training loss over 1000 steps showing convergence around 0.3335* |
| |  |
| | *Learning rate decay schedule with final rate of 2.08304527802282E-06* |
| |
|
| | --- |
| |
|
| | ## π Dataset |
| |
|
| | Custom dataset extracted from Linux kernel Git history. |
| |
|
| | ### Filtering Criteria |
| | Bug-fix commits containing: |
| | `fix`, `bug`, `crash`, `memory`, `null`, `panic`, `overflow`, `race`, `corruption`, etc. |
| |
|
| | ### Structure |
| | - Language: C (`.c`, `.h`) |
| | - Context: 10 lines before/after the change |
| | - Format: |
| |
|
| | ```json |
| | { |
| | "input": { |
| | "original code": "C code snippet with bug", |
| | "instruction": "Commit message or fix description" |
| | }, |
| | "output": { |
| | "diff codes": "Git diff showing the fix" |
| | } |
| | } |
| | ``` |
| |
|
| | * **File**: `training_data_100k.jsonl` (100,000 samples) |
| |
|
| | --- |
| |
|
| | ## π Quick Start |
| |
|
| | ### Prerequisites |
| |
|
| | - Python 3.8+ |
| | - CUDA-compatible GPU (recommended) |
| | - 16GB+ RAM |
| | - 50GB+ disk space |
| |
|
| | ### Install dependencies |
| |
|
| | ```bash |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | ### 1. Build the Dataset |
| |
|
| | ```bash |
| | cd dataset_builder |
| | python extract_linux_bugfixes_parallel.py |
| | python format_for_training.py |
| | ``` |
| |
|
| | ### 2. Fine-tune the Model |
| |
|
| | ```bash |
| | cd train |
| | python train_codellama_qlora_linux_bugfix.py |
| | ``` |
| |
|
| | ### 3. Run Evaluation |
| |
|
| | ```bash |
| | cd evaluate |
| | python evaluate_linux_bugfix_model.py |
| | ``` |
| |
|
| | ### 4. Use the Model |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | from peft import PeftModel |
| | |
| | # Load the fine-tuned model |
| | model = AutoModelForCausalLM.from_pretrained("codellama/CodeLLaMA-7b-Instruct-hf") |
| | model = PeftModel.from_pretrained(model, "train/output/qlora-codellama-bugfix") |
| | tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLLaMA-7b-Instruct-hf") |
| | |
| | # Generate a bug fix |
| | prompt = """ |
| | Given the following original C code: |
| | if (!file->filter) |
| | return; |
| | |
| | Instruction: Fix the null pointer dereference |
| | |
| | Return the diff that fixes it: |
| | """ |
| | |
| | inputs = tokenizer(prompt, return_tensors="pt") |
| | outputs = model.generate(**inputs, max_length=512, temperature=0.1) |
| | fix = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | print(fix) |
| | ``` |
| |
|
| | --- |
| |
|
| | ## π Project Structure |
| |
|
| | ``` |
| | CodeLLaMA-Linux-BugFix/ |
| | βββ dataset_builder/ |
| | β βββ extract_linux_bugfixes_parallel.py # Parallel extraction of bug fixes |
| | β βββ format_for_training.py # Format data for training |
| | β βββ build_dataset.py # Main dataset builder |
| | βββ dataset/ |
| | β βββ training_data_100k.jsonl # 100K training samples |
| | β βββ training_data_prompt_completion.jsonl # Formatted training data |
| | βββ train/ |
| | β βββ train_codellama_qlora_linux_bugfix.py # Main training script |
| | β βββ train_codellama_qlora_simple.py # Simplified training |
| | β βββ download_codellama_model.py # Model download utility |
| | β βββ output/ |
| | β βββ qlora-codellama-bugfix/ # Trained model checkpoints |
| | βββ evaluate/ |
| | β βββ evaluate_linux_bugfix_model.py # Evaluation script |
| | β βββ test_samples.jsonl # Test dataset |
| | β βββ output/ # Evaluation results |
| | β βββ eval_results.csv # Detailed results |
| | β βββ eval_results.json # JSON format results |
| | βββ requirements.txt # Python dependencies |
| | βββ README.md # This file |
| | βββ PROJECT_STRUCTURE.md # Detailed project overview |
| | ``` |
| |
|
| | --- |
| |
|
| | ## π§© Features |
| |
|
| | * π§ **Efficient Fine-tuning**: QLoRA + 4-bit quant = massive memory savings |
| | * π§ **Real-world commits**: From actual Linux kernel development |
| | * π‘ **Context-aware**: Code context extraction around bug lines |
| | * π» **Output-ready**: Generates valid Git-style diffs |
| | * π **Strong Performance**: BLEU score of 33.87 with good ROUGE metrics |
| | * π **Production-ready**: Optimized for real-world deployment |
| |
|
| | --- |
| |
|
| | ## π Evaluation Metrics |
| |
|
| | * **BLEU**: Translation-style match to reference diffs |
| | * **ROUGE**: Overlap in fix content and semantic similarity |
| | * **Human Evaluation**: Subjective patch quality assessment |
| |
|
| | ### Current Performance |
| | - **BLEU Score**: 33.87 (excellent for code generation tasks) |
| | - **ROUGE-1 F1**: 0.4355 (good semantic overlap) |
| | - **ROUGE-2 F1**: 0.3457 (reasonable bigram matching) |
| | - **ROUGE-L F1**: 0.3612 (good longest common subsequence) |
| |
|
| | --- |
| |
|
| | ## π§ͺ Use Cases |
| |
|
| | * **Automated kernel bug fixing**: Generate fixes for common kernel bugs |
| | * **Code review assistance**: Help reviewers identify potential issues |
| | * **Teaching/debugging kernel code**: Educational tool for kernel development |
| | * **Research in automated program repair (APR)**: Academic research applications |
| | * **CI/CD integration**: Automated testing and fixing in development pipelines |
| |
|
| | --- |
| |
|
| | ## π¬ Technical Highlights |
| |
|
| | ### Memory & Speed Optimizations |
| |
|
| | * 4-bit quantization (NF4) |
| | * Gradient checkpointing |
| | * Mixed precision (bfloat16) |
| | * Gradient accumulation |
| | * LoRA parameter efficiency |
| |
|
| | ### Training Efficiency |
| |
|
| | * **QLoRA**: Reduces memory usage by ~75% |
| | * **4-bit quantization**: Further memory optimization |
| | * **Gradient checkpointing**: Trades compute for memory |
| | * **Mixed precision**: Faster training with maintained accuracy |
| |
|
| | --- |
| |
|
| | ## π οΈ Advanced Usage |
| |
|
| | ### Custom Training |
| |
|
| | ```bash |
| | # Train with custom parameters |
| | python train_codellama_qlora_linux_bugfix.py \ |
| | --learning_rate 1e-4 \ |
| | --num_epochs 5 \ |
| | --batch_size 32 \ |
| | --lora_r 32 \ |
| | --lora_alpha 16 |
| | ``` |
| |
|
| | ### Evaluation on Custom Data |
| |
|
| | ```bash |
| | # Evaluate on your own test set |
| | python evaluate_linux_bugfix_model.py \ |
| | --test_file your_test_data.jsonl \ |
| | --output_dir custom_eval_results |
| | ``` |
| |
|
| | --- |
| |
|
| | ## π€ Contributing |
| |
|
| | 1. Fork this repo |
| | 2. Create a feature branch (`git checkout -b feature/amazing-feature`) |
| | 3. Commit your changes (`git commit -m 'Add amazing feature'`) |
| | 4. Push to the branch (`git push origin feature/amazing-feature`) |
| | 5. Open a Pull Request π |
| |
|
| | ### Development Guidelines |
| |
|
| | - Follow PEP 8 style guidelines |
| | - Add tests for new features |
| | - Update documentation for API changes |
| | - Ensure all tests pass before submitting PR |
| |
|
| | --- |
| |
|
| | ## π License |
| |
|
| | MIT License β see `LICENSE` file for details. |
| |
|
| | --- |
| |
|
| | ## π Acknowledgments |
| |
|
| | * **Meta** for CodeLLaMA base model |
| | * **Hugging Face** for Transformers + PEFT libraries |
| | * **The Linux kernel community** for open access to commit data |
| | * **Microsoft** for introducing LoRA technique |
| | * **University of Washington** for QLoRA research |
| |
|
| | --- |
| |
|
| | ## π References |
| |
|
| | * [CodeLLaMA (Meta, 2023)](https://arxiv.org/abs/2308.12950) |
| | * [QLoRA (Dettmers et al., 2023)](https://arxiv.org/abs/2305.14314) |
| | * [LoRA (Hu et al., 2021)](https://arxiv.org/abs/2106.09685) |
| | * [Automated Program Repair: A Survey](https://ieeexplore.ieee.org/document/8449519) |
| |
|
| | --- |
| |
|
| | ## π Support |
| |
|
| | For questions, issues, or contributions: |
| | - Open an issue on GitHub |
| | - Check the project documentation |
| | - Review the evaluation results in `evaluate/output/` |
| |
|
| | --- |
| |
|
| | ## π Version History |
| |
|
| | - **v1.0.0**: Initial release with QLoRA training |
| | - **v1.1.0**: Added parallel dataset extraction |
| | - **v1.2.0**: Improved evaluation metrics and documentation |
| | ======= |
| | --- |
| | license: mit |
| | tags: |
| | - codellama |
| | - linux |
| | - bugfix |
| | - lora |
| | - qlora |
| | - git-diff |
| | base_model: codellama/CodeLLaMA-7b-Instruct-hf |
| | model_type: LlamaForCausalLM |
| | library_name: peft |
| | pipeline_tag: text-generation |
| | --- |
| |
|
| | # CodeLLaMA-Linux-BugFix |
| |
|
| | A fine-tuned version of `CodeLLaMA-7B-Instruct`, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages. |
| |
|
| | --- |
| |
|
| | ## π― Overview |
| |
|
| | This project targets automated Linux kernel bug fixing by: |
| |
|
| | - **Mining real commit data** from the kernel Git history |
| | - **Training a specialized QLoRA model** on diff-style fixes |
| | - **Generating Git patches** in response to bug-prone code |
| | - **Evaluating results** using BLEU, ROUGE, and human inspection |
| |
|
| | The model achieves strong performance in generating accurate Linux kernel bug fixes, making it a valuable tool for automated code review and bug detection. |
| |
|
| | --- |
| |
|
| | ## π Performance Results |
| |
|
| | ### Evaluation Metrics |
| |
|
| | β
**BLEU Score**: 33.87 |
| |
|
| | β
**ROUGE Scores**: |
| | - **ROUGE-1**: P=0.3775, R=0.7306, F1=0.4355 |
| | - **ROUGE-2**: P=0.2898, R=0.6096, F1=0.3457 |
| | - **ROUGE-L**: P=0.3023, R=0.6333, F1=0.3612 |
| |
|
| | These results demonstrate the model's ability to: |
| | - Generate syntactically correct Git diff patches |
| | - Maintain semantic similarity to reference fixes |
| | - Produce meaningful code changes that address the underlying bugs |
| |
|
| | --- |
| |
|
| | ## π§ Model Configuration |
| |
|
| | - **Base model**: `CodeLLaMA-7B-Instruct` |
| | - **Fine-tuning method**: QLoRA with 4-bit quantization |
| | - **Training setup**: |
| | - LoRA r=64, alpha=16, dropout=0.1 |
| | - Batch size: 64, LR: 2e-4, Epochs: 3 |
| | - Mixed precision (bfloat16), gradient checkpointing |
| | - **Hardware**: Optimized for NVIDIA H200 GPUs |
| |
|
| | --- |
| |
|
| | ## π Dataset |
| |
|
| | Custom dataset extracted from Linux kernel Git history. |
| |
|
| | ### Filtering Criteria |
| | Bug-fix commits containing: |
| | `fix`, `bug`, `crash`, `memory`, `null`, `panic`, `overflow`, `race`, `corruption`, etc. |
| |
|
| | ### Structure |
| | - Language: C (`.c`, `.h`) |
| | - Context: 10 lines before/after the change |
| | - Format: |
| |
|
| | ```json |
| | { |
| | "input": { |
| | "original code": "C code snippet with bug", |
| | "instruction": "Commit message or fix description" |
| | }, |
| | "output": { |
| | "diff codes": "Git diff showing the fix" |
| | } |
| | } |
| | ``` |
| |
|
| | * **File**: `training_data_100k.jsonl` (100,000 samples) |
| |
|
| | --- |
| |
|
| | ## π Quick Start |
| |
|
| | ### Prerequisites |
| |
|
| | - Python 3.8+ |
| | - CUDA-compatible GPU (recommended) |
| | - 16GB+ RAM |
| | - 50GB+ disk space |
| |
|
| | ### Install dependencies |
| |
|
| | ```bash |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | ### 1. Build the Dataset |
| |
|
| | ```bash |
| | cd dataset_builder |
| | python extract_linux_bugfixes_parallel.py |
| | python format_for_training.py |
| | ``` |
| |
|
| | ### 2. Fine-tune the Model |
| |
|
| | ```bash |
| | cd train |
| | python train_codellama_qlora_linux_bugfix.py |
| | ``` |
| |
|
| | ### 3. Run Evaluation |
| |
|
| | ```bash |
| | cd evaluate |
| | python evaluate_linux_bugfix_model.py |
| | ``` |
| |
|
| | ### 4. Use the Model |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | from peft import PeftModel |
| | |
| | # Load the fine-tuned model |
| | model = AutoModelForCausalLM.from_pretrained("codellama/CodeLLaMA-7b-Instruct-hf") |
| | model = PeftModel.from_pretrained(model, "train/output/qlora-codellama-bugfix") |
| | tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLLaMA-7b-Instruct-hf") |
| | |
| | # Generate a bug fix |
| | prompt = """ |
| | Given the following original C code: |
| | if (!file->filter) |
| | return; |
| | |
| | Instruction: Fix the null pointer dereference |
| | |
| | Return the diff that fixes it: |
| | """ |
| | |
| | inputs = tokenizer(prompt, return_tensors="pt") |
| | outputs = model.generate(**inputs, max_length=512, temperature=0.1) |
| | fix = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | print(fix) |
| | ``` |
| |
|
| | --- |
| |
|
| | ## π Project Structure |
| |
|
| | ``` |
| | CodeLLaMA-Linux-BugFix/ |
| | βββ dataset_builder/ |
| | β βββ extract_linux_bugfixes_parallel.py # Parallel extraction of bug fixes |
| | β βββ format_for_training.py # Format data for training |
| | β βββ build_dataset.py # Main dataset builder |
| | βββ dataset/ |
| | β βββ training_data_100k.jsonl # 100K training samples |
| | β βββ training_data_prompt_completion.jsonl # Formatted training data |
| | βββ train/ |
| | β βββ train_codellama_qlora_linux_bugfix.py # Main training script |
| | β βββ train_codellama_qlora_simple.py # Simplified training |
| | β βββ download_codellama_model.py # Model download utility |
| | β βββ output/ |
| | β βββ qlora-codellama-bugfix/ # Trained model checkpoints |
| | βββ evaluate/ |
| | β βββ evaluate_linux_bugfix_model.py # Evaluation script |
| | β βββ test_samples.jsonl # Test dataset |
| | β βββ output/ # Evaluation results |
| | β βββ eval_results.csv # Detailed results |
| | β βββ eval_results.json # JSON format results |
| | βββ requirements.txt # Python dependencies |
| | βββ README.md # This file |
| | βββ PROJECT_STRUCTURE.md # Detailed project overview |
| | ``` |
| |
|
| | --- |
| |
|
| | ## π§© Features |
| |
|
| | * π§ **Efficient Fine-tuning**: QLoRA + 4-bit quant = massive memory savings |
| | * π§ **Real-world commits**: From actual Linux kernel development |
| | * π‘ **Context-aware**: Code context extraction around bug lines |
| | * π» **Output-ready**: Generates valid Git-style diffs |
| | * π **Strong Performance**: BLEU score of 33.87 with good ROUGE metrics |
| | * π **Production-ready**: Optimized for real-world deployment |
| |
|
| | --- |
| |
|
| | ## π Evaluation Metrics |
| |
|
| | * **BLEU**: Translation-style match to reference diffs |
| | * **ROUGE**: Overlap in fix content and semantic similarity |
| | * **Human Evaluation**: Subjective patch quality assessment |
| |
|
| | ### Current Performance |
| | - **BLEU Score**: 33.87 (excellent for code generation tasks) |
| | - **ROUGE-1 F1**: 0.4355 (good semantic overlap) |
| | - **ROUGE-2 F1**: 0.3457 (reasonable bigram matching) |
| | - **ROUGE-L F1**: 0.3612 (good longest common subsequence) |
| |
|
| | --- |
| |
|
| | ## π§ͺ Use Cases |
| |
|
| | * **Automated kernel bug fixing**: Generate fixes for common kernel bugs |
| | * **Code review assistance**: Help reviewers identify potential issues |
| | * **Teaching/debugging kernel code**: Educational tool for kernel development |
| | * **Research in automated program repair (APR)**: Academic research applications |
| | * **CI/CD integration**: Automated testing and fixing in development pipelines |
| |
|
| | --- |
| |
|
| | ## π¬ Technical Highlights |
| |
|
| | ### Memory & Speed Optimizations |
| |
|
| | * 4-bit quantization (NF4) |
| | * Gradient checkpointing |
| | * Mixed precision (bfloat16) |
| | * Gradient accumulation |
| | * LoRA parameter efficiency |
| |
|
| | ### Training Efficiency |
| |
|
| | * **QLoRA**: Reduces memory usage by ~75% |
| | * **4-bit quantization**: Further memory optimization |
| | * **Gradient checkpointing**: Trades compute for memory |
| | * **Mixed precision**: Faster training with maintained accuracy |
| |
|
| | --- |
| |
|
| | ## π οΈ Advanced Usage |
| |
|
| | ### Custom Training |
| |
|
| | ```bash |
| | # Train with custom parameters |
| | python train_codellama_qlora_linux_bugfix.py \ |
| | --learning_rate 1e-4 \ |
| | --num_epochs 5 \ |
| | --batch_size 32 \ |
| | --lora_r 32 \ |
| | --lora_alpha 16 |
| | ``` |
| |
|
| | ### Evaluation on Custom Data |
| |
|
| | ```bash |
| | # Evaluate on your own test set |
| | python evaluate_linux_bugfix_model.py \ |
| | --test_file your_test_data.jsonl \ |
| | --output_dir custom_eval_results |
| | ``` |
| |
|
| | --- |
| |
|
| | ## π€ Contributing |
| |
|
| | 1. Fork this repo |
| | 2. Create a feature branch (`git checkout -b feature/amazing-feature`) |
| | 3. Commit your changes (`git commit -m 'Add amazing feature'`) |
| | 4. Push to the branch (`git push origin feature/amazing-feature`) |
| | 5. Open a Pull Request π |
| |
|
| | ### Development Guidelines |
| |
|
| | - Follow PEP 8 style guidelines |
| | - Add tests for new features |
| | - Update documentation for API changes |
| | - Ensure all tests pass before submitting PR |
| |
|
| | --- |
| |
|
| | ## π License |
| |
|
| | MIT License β see `LICENSE` file for details. |
| |
|
| | --- |
| |
|
| | ## π Acknowledgments |
| |
|
| | * **Meta** for CodeLLaMA base model |
| | * **Hugging Face** for Transformers + PEFT libraries |
| | * **The Linux kernel community** for open access to commit data |
| | * **Microsoft** for introducing LoRA technique |
| | * **University of Washington** for QLoRA research |
| |
|
| | --- |
| |
|
| | ## π References |
| |
|
| | * [CodeLLaMA (Meta, 2023)](https://arxiv.org/abs/2308.12950) |
| | * [QLoRA (Dettmers et al., 2023)](https://arxiv.org/abs/2305.14314) |
| | * [LoRA (Hu et al., 2021)](https://arxiv.org/abs/2106.09685) |
| | * [Automated Program Repair: A Survey](https://ieeexplore.ieee.org/document/8449519) |
| |
|
| | --- |
| |
|
| | ## π Support |
| |
|
| | For questions, issues, or contributions: |
| | - Open an issue on GitHub |
| | - Check the project documentation |
| | - Review the evaluation results in `evaluate/output/` |
| |
|
| | --- |
| |
|
| | ## π Version History |
| |
|
| | - **v1.0.0**: Initial release with QLoRA training |
| | - **v1.1.0**: Added parallel dataset extraction |
| | - **v1.2.0**: Improved evaluation metrics and documentation |
| |
|