I fine-tuned an SLM — here’s what helped me get good results (and other learnings)
Enhancing Model Fine-Tuning: Insights from Developing a Malicious Query Classifier
In recent efforts to improve AI security, I dedicated time this weekend to fine-tuning the Qwen-3 0.6B language model. My goal was to develop a lightweight yet effective tool capable of discerning malicious prompts fed into my AI agents, ensuring a safer user experience.
Data Collection and Initial Attempts
My first step involved assembling a comprehensive dataset of over 4,000 malicious queries generated with GPT-4o. To maintain balance, I also compiled an equal number of benign queries. This dataset served as the foundation for training.
The initial approach, Supervised Fine-Tuning (SFT), involved training the model solely on this dataset. Unfortunately, the result was underwhelming—every input was flagged as malicious, rendering the model ineffective for real-world deployment.
Refining the Model with Prompt-Tuning and Edge Case Handling
Next, I shifted to fine-tuning the Qwen-3 0.6B model itself, incorporating prompt-tuning techniques to better guide its responses. While this improved accuracy somewhat, the model still struggled with nuanced cases. For example, benign prompts containing keywords like “System prompt” would incorrectly be flagged as malicious, highlighting the need for more sophisticated reasoning.
Introducing Chain of Thought for Better Discrimination
To address this, I experimented with embedding a reasoning process within the model’s predictions. I designed a dataset where each malicious query was accompanied by a brief explanation—essentially asking the model to “think aloud” before concluding. By fine-tuning the model with these reasoning annotations, I observed a significant accuracy boost, especially in tricky edge cases.
Results and Deployment Plans
This iterative process culminated in a model that reliably distinguishes malicious queries from harmless ones. I’m pleased with the performance and plan to integrate this as a middleware layer between users and my AI agents, enhancing security and trustworthiness.
Open Source Access
For those interested in experimenting or building upon this work, the final model and code are openly available on Hugging Face. You can set it up effortlessly by copying and pasting the provided implementation snippet:
https://github.com/sarthakrastogi/rival
By sharing these insights, I hope others can replicate and adapt this approach to improve AI safety in their own projects.
Post Comment