I fine-tuned an SLM — here’s what helped me get good results (and other learnings)
Optimizing AI Security: My Journey Fine-Tuning a Lightweight Language Model for Malicious Prompt Detection
Enhancing AI safety measures is crucial in today’s rapidly evolving technological landscape. Recently, I embarked on a project to develop a compact, efficient model capable of discerning malicious user prompts from benign ones—an essential feature for safeguarding AI agents. Here’s an overview of my process, insights gained, and practical lessons for fellow developers and researchers.
The Objective
My goal was to create a streamlined classification tool that could accurately identify potentially harmful queries inputted into my AI systems. To achieve this, I focused on fine-tuning an accessible and lightweight language model that would seamlessly integrate into my existing infrastructure.
Dataset Assembly
The foundation of effective model training lies in quality data. I curated a dataset comprising over 4,000 malicious queries generated with GPT-4, complemented by an equal number of innocuous, harmless prompts. This balanced dataset aimed to provide the model with clear distinctions during training.
Initial Approach and Challenges
First Attempt:
Using supervised fine-tuning (SFT) on the base version of the model with this dataset resulted in a model that was overly aggressive—classifying every query as malicious. Clearly, the model lacked the nuanced understanding necessary for real-world accuracy.
Second Attempt:
I progressed to fine-tuning a more advanced model, Qwen-3 0.6B, and invested additional effort into crafting precise prompts and instructions. While this improved the classification accuracy slightly, the system still faltered in edge cases. For example, harmless prompts containing specific keywords like “System prompt” would get mistakenly flagged.
The Power of Chain of Thought Reasoning
Recognizing the potential of more sophisticated reasoning techniques, I hypothesized that integrating “Chain of Thought” (CoT) prompting could enhance performance. To test this, I amended my dataset to include explicit reasoning behind each malicious query—essentially teaching the model why a particular prompt was malicious.
Third Attempt:
Fine-tuning the model with these reasoning annotations marked a turning point. The results were striking: the model achieved high accuracy across various scenarios, including those tricky edge cases. Incorporating reasoning steps enabled the model to understand context better before making a classification.
Key Takeaways
- Structured Data Matters: Enrich datasets with contextual explanations; labels alone may not suffice.
- Prompt Engineering Is Crucial: More detailed instructions and reasoning can significantly improve model comprehension.
- **Iter
Post Comment