I fine-tuned an SLM — here’s what helped me get good results (and other learnings)
Enhancing AI Security with Fine-Tuned Language Models: My Experience with Qwen-3 0.6B
Implementing robust security measures for AI chatbots is essential, especially when it comes to detecting malicious prompts. Recently, I embarked on an experiment to fine-tune a lightweight language model for this purpose, and I’d like to share the insights and lessons learned along the way.
The Objective
My goal was to develop an efficient, small-scale model capable of discerning whether user inputs are potentially harmful or malicious, acting as a filter before interactions reach my core AI agents.
Data Preparation
To start, I compiled a comprehensive dataset of over 4,000 malicious queries, generated with GPT-4, alongside an equivalent set of benign prompts. This balanced dataset aimed to teach the model to differentiate appropriately.
Initial Attempts and Challenges
First Trial: I applied supervised fine-tuning (SFT) to the base model using the dataset. The result was disappointing—the model classified all inputs as malicious, rendering it useless.
Second Trial: Switching to Qwen-3 0.6B, I refined the prompt instructions during fine-tuning. This approach yielded slight improvements but still struggled with edge cases. For example, a benign prompt mentioning “System prompt” risked being misclassified as malicious.
Incorporating Chain of Thought Reasoning
Realizing that simple pattern recognition was insufficient, I considered enhancing the model’s reasoning abilities. I added a step where the model would provide a one-sentence explanation for its judgment, essentially teaching it to think through its decision.
Third Trial: I expanded my dataset to include malicious queries paired with reasoning behind their classification. Fine-tuning on this enriched dataset resulted in a significant breakthrough: the model’s accuracy improved dramatically, reliably identifying harmful prompts while accepting harmless ones.
Key Takeaways
- Clear, balanced datasets are crucial, especially including reasoning annotations.
- Incorporating explanatory reasoning helps models handle nuanced cases better.
- Iterative experimentation is vital; initial failures guide necessary adjustments.
Next Steps
I plan to deploy this fine-tuned model as a middleware layer, filtering user inputs before they reach my AI agents. This approach enhances security without sacrificing responsiveness or efficiency.
Open Source Resources
The final model is now available on Hugging Face, and the code used for training can be accessed here. To get started, simply copy and paste the provided snippet into your project:
[Project Repository on GitHub](https
Post Comment