I fine-tuned an SLM — here’s what helped me get good results (and other learnings)
Enhancing AI Security: Fine-Tuning a Lightweight Model for Malicious Query Detection
In the quest to bolster AI security, I recently undertook the task of fine-tuning a small language model to reliably identify potentially harmful user inputs. My goal was to develop an efficient, lightweight model capable of discerning malicious prompts that could threaten the integrity of AI agents I work with.
Data Collection and Preparation
The foundation of this project involved compiling a robust dataset: over 4,000 malicious queries generated via GPT-4, paired with a similarly sized set of benign, harmless inputs. This balanced approach aimed to provide the model with clear examples of both classes to learn from effectively.
Initial Training Attempts and Challenges
My first approach involved applying supervised fine-tuning (SFT) directly to the base version of the small language model (SLM) using this dataset. Unfortunately, the outcome was far from what I anticipated—the model tended to classify every query as malicious, rendering it practically unusable.
Subsequently, I refined my strategy by fine-tuning a more specific model variant—Qwen-3 0.6B—and also invested additional effort into prompt engineering, providing clearer instructions during training. This led to a modest increase in accuracy, yet the model still faltered when encountering subtle cases. For example, harmless prompts mentioning “System prompt” occasionally got misclassified as malicious.
Incorporating Chain of Thought for Better Precision
Realizing that the model needed a deeper reasoning process, I decided to embrace Chain of Thought (CoT) prompting techniques. The idea was to have the model generate a sentence explaining its rationale before making a final judgment.
To facilitate this, I curated a new dataset where each malicious query was annotated with a brief explanation of why it was harmful. Fine-tuning on this enriched dataset yielded a breakthrough: the model now classifies queries with a high degree of accuracy, even in edge cases. This “aha” moment confirmed that encouraging the model to reason step-by-step significantly improved its reliability.
Deployment and Open Source Availability
I’m now planning to deploy this model as an intermediary filter between users and AI agents, enhancing overall security by preemptively flagging malicious inputs.
For those interested in experimenting with this setup, the complete model and code are openly available on Hugging Face. Simply copy and run the provided snippets to integrate the model into your workflow:
Post Comment