I fine-tuned an SLM — here’s what helped me get good results (and other learnings)
Optimizing AI Security: Fine-Tuning a Language Model for Malicious Query Detection
This past weekend, I embarked on a project to enhance the security capabilities of a language model by fine-tuning Qwen-3 0.6B—aiming to accurately identify potentially malicious user inputs in AI-driven applications. The goal was to develop a lightweight yet effective classifier that can serve as an intermediary filter, safeguarding my AI agents from harmful prompts.
Building the Dataset
To start, I compiled a comprehensive dataset comprising over 4,000 examples of malicious queries generated with GPT-4. Recognizing the importance of balanced data, I also curated an equivalent number of benign, harmless queries to ensure the model can distinguish between the two effectively.
First Attempt: Basic Fine-Tuning
Initially, I applied supervised fine-tuning (SFT) to the base version of the language model using this dataset. Unfortunately, the result was underwhelming—the model defaulted to labeling every input as malicious, rendering it unusable as a filter.
Second Attempt: Fine-Tuning the Full Model & Prompt Refinement
Undeterred, I shifted focus to fine-tuning the complete Qwen-3 0.6B model, combined with meticulous prompt instruction tuning. While this approach yielded a slight improvement in accuracy, it still struggled with nuanced cases. For example, harmless prompts mentioning specific keywords like “System prompt” were incorrectly flagged as malicious, highlighting the model’s difficulty with edge cases.
Incorporating Chain of Thought Reasoning
Realizing the potential of reasoning-based prompts, I decided to enhance the model’s decision-making process by introducing a chain of thought approach. I designed the dataset to include not only malicious queries but also accompanying explanations—a rationale behind each classification.
Third Attempt: Reasoning-Enhanced Fine-Tuning
With this enriched dataset, I conducted another round of fine-tuning. The impact was immediate and significant. The model now generates a one-sentence reasoning process before making its classification, greatly improving accuracy—especially in tricky edge cases.
Final Results & Next Steps
This iterative process culminated in a highly effective model capable of discerning malicious prompts with remarkable precision. I plan to implement this model as middleware between end-users and my AI agents, providing a robust line of defense against malicious inputs.
Open Source Availability
I’ve made the final model and the accompanying code publicly accessible on Hugging Face. You can quickly deploy it by copying a simple code snippet from my repository:
[Link to Git
Post Comment