I fine-tuned an SLM — here’s what helped me get good results (and other learnings)
Optimizing AI Security: My Experience Fine-Tuning Qwen-3 0.6B for Malicious Prompt Detection
Embarking on the journey to enhance AI safety measures, I recently undertook the task of fine-tuning the Qwen-3 0.6B language model to effectively distinguish between legitimate user queries and potentially malicious prompts. Here’s an overview of my process, the challenges encountered, and key insights that could benefit others working on similar projects.
Building the Dataset
The foundation of my approach involved creating a comprehensive dataset comprising over 4,000 examples of malicious queries generated with GPT-4. To ensure balanced training, I also compiled an equivalent set of benign queries. This dataset aimed to teach the model to reliably differentiate harmful prompts from harmless ones.
Initial Fine-Tuning Attempts
My first attempt involved applying supervised fine-tuning (SFT) on the base language model directly with this dataset. Unfortunately, the resulting model was overly aggressive, flagging every query as malicious—rendering it unusable for practical purposes.
Refining the Approach
Next, I shifted to fine-tuning the Qwen-3 0.6B variant, integrating more nuanced prompt instructions to guide the model’s behavior. While this improved accuracy somewhat, the model still struggled with edge cases; for instance, benign prompts containing terms like “System prompt” would sometimes be flagged erroneously.
Introducing Chain of Thought Reasoning
Realizing that context and reasoning might be key, I experimented with enabling the model to produce a brief explanation for its decision—what’s known as Chain of Thought prompting. This involved retraining the model to generate a one-sentence rationale before classifying each query.
Enhanced Dataset and Training
To facilitate this, I expanded my dataset to include reasoning annotations for malicious examples and retrained the model accordingly. The results were striking: the model achieved significantly higher accuracy, reliably identifying threats without false positives on innocuous prompts.
Deployment and Resources
Encouraged by this success, I plan to deploy this fine-tuned model as a middleware layer, screening user inputs before they reach my AI agents. The final model and its code are openly available on Hugging Face, providing a straightforward starting point for others interested in similar security solutions.
You can explore the implementation and replicate the results here: GitHub Repository
Final thoughts
Fine-tuning language models for specific safety tasks requires iterative experimentation, especially when confronting edge cases
Post Comment