I fine-tuned an SLM — here’s what helped me get good results (and other learnings)
Enhancing AI Security: My Journey in Fine-Tuning a Small Language Model for Malicious Query Detection
Developing reliable AI systems requires rigorous testing and precise filtering, especially when it comes to safeguarding against malicious input. Recently, I embarked on fine-tuning a compact language model to identify potentially harmful user prompts before they interact with my AI agents. Here’s a detailed account of the process, lessons learned, and the strategies that led to successful results.
Choosing the Right Model and Dataset
My goal was to create a lightweight, efficient classifier capable of discerning malicious queries. I selected Qwen-3 0.6B, a model known for its balance of performance and efficiency. To train it effectively, I compiled a dataset comprising over 4,000 malicious prompts generated with GPT-4, complemented by an equal number of benign queries to ensure balanced learning.
Initial Approach and Challenges
The first attempt involved applying supervised fine-tuning (SFT) on the base language model using the collected data. Unfortunately, this resulted in an overly aggressive classifier—it labeled every input as malicious, rendering it useless in practice. Clearly, more nuanced training was needed.
Refining the Fine-Tuning Process
Next, I shifted to fine-tuning the same model but incorporated prompt-based instruction tuning to guide its understanding better. Although this improved the model’s accuracy slightly, it still struggled with edge cases. For instance, harmless prompts containing keywords like “System prompt” sometimes got flagged erroneously.
This highlighted a common challenge: the model lacked deeper reasoning capabilities to distinguish subtle differences.
Introducing Chain of Thought Reasoning
To address this, I realized that integrating a simple reasoning process could substantially improve performance. I decided to train the model to produce a brief explanation behind each classification—essentially, a minimal chain of thought that justified its decision.
Improved Results and Final Outcome
For this, I expanded my dataset to include not just malicious prompts but also associated reasoning comments describing why each prompt was malicious. Fine-tuning the model on this enriched data yielded a breakthrough: the classifier now identified malicious and benign inputs with high accuracy, including tricky edge cases.
I’m pleased to report that this refined model operates reliably and will serve as an intermediary filter between users and my AI agents, enhancing security without sacrificing user experience.
Open-Source Availability
The culmination of this project is a lightweight, open-source model hosted on Hugging Face. The accompanying code snippet simplifies integration into your own applications:
[Link to
Post Comment