I fine-tuned an SLM — here’s what helped me get good results (and other learnings)
Optimizing AI Safety: Fine-Tuning a Small Language Model for Malicious Prompt Detection
In recent developments within AI security and safety, fine-tuning language models has proven to be a vital strategy. This week, I embarked on an effort to refine a lightweight language model—specifically, the Qwen-3 0.6B—to accurately identify potentially malicious user prompts directed at AI agents.
Building the Dataset
My first step involved assembling a comprehensive dataset comprising over 4,000 malicious prompts generated with GPT-4. To balance the training data, I paired these with an equally large collection of benign queries, ensuring the model could distinguish between harmful and harmless inputs effectively.
Initial Fine-Tuning Attempts
The initial attempt involved supervised fine-tuning (SFT) using the base version of the model on this dataset. Unfortunately, the results were disappointing: the model tended to classify every input as malicious, rendering it ineffective for practical use.
Subsequently, I moved to fine-tune Qwen-3-0.6B directly, incorporating prompts and instructions to guide the model’s understanding better. While this approach yielded marginal improvements in accuracy, the model still struggled with nuanced cases. For example, benign prompts containing specific keywords like “System prompt” were incorrectly flagged as malicious—highlighting areas for enhancement.
Incorporating Chain of Thought Reasoning
Recognizing the need for the model to reason more effectively, I hypothesized that integrating a Chain of Thought (CoT) approach might help. To test this, I created a new dataset that included not only malicious prompts but also annotated reasoning behind each classification. This meant instructing the model to generate a one-sentence explanation for its decision.
Final Results and Insights
Fine-tuning the model on this enriched dataset was an “aha” moment. The model now achieved high accuracy in distinguishing malicious prompts, even in complex edge cases. I am pleased with these results and plan to deploy this fine-tuned model as a middleware layer between users and my AI agents, enhancing safety and reliability.
Open Source and How to Use
The entire fine-tuning process and code are openly available on Hugging Face. You can access the repository here: GitHub Link. Simply copy the provided snippet to integrate the detection model into your own workflows.
This experience underscores the importance of iterative training, prompt engineering, and reasoning techniques in developing more robust AI safety
Post Comment