I fine-tuned an SLM — here’s what helped me get good results (and other learnings)
Optimizing AI Content Filtering: Fine-Tuning a Small Language Model for Malicious Query Detection
This past weekend, I embarked on an exciting project to enhance the security and robustness of AI interactions by fine-tuning a lightweight language model. My goal was to develop an efficient model capable of discerning malicious prompts from benign user inputs, ensuring safer deployment of AI agents.
Initial Approach: Gathering and Preparing Data
I began by curating a comprehensive dataset comprising over 4,000 malicious queries generated with GPT-4o. To balance the training data, I also compiled an equal number of harmless queries. The intention was to create a robust foundation for the model to distinguish malicious content effectively.
First Attempt: Direct Fine-Tuning of the Base Model
Using this dataset, I initially fine-tuned the base version of the Small Language Model (SLM) via supervised fine-tuning (SFT). However, the results were underwhelming—all inputs were flagged as malicious, rendering the model unreliable for practical use.
Second Attempt: Fine-Tuning a More Advanced Model
Next, I shifted focus to a more capable model—Qwen-3 0.6B—and incorporated prompt-tuning for better instruction understanding. Although this approach slightly improved accuracy, the model still struggled with edge cases. For instance, harmless prompts containing specific terms like “System prompt” were incorrectly flagged as malicious.
Realization: The Power of Chain of Thought Reasoning
To address these challenges, I realized that incorporating reasoning steps—often referred to as Chain of Thought—could significantly enhance the model’s discernment. I decided to guide the model to generate a brief explanation behind each classification, encouraging it to reason before making a decision.
Third Attempt: Enhancing the Dataset with Rationales
Consequently, I created a new dataset that included not only the malicious queries but also annotated reasoning statements explaining why each prompt was deemed malicious. Fine-tuning the model on this enriched data led to a breakthrough: the model now predicts with high accuracy, especially when guided by the reasoning prompts.
Final Outcome and Deployment
This iterative process culminated in a robust, lightweight classifier that reliably filters out malicious queries. I’m pleased to share that the resulting model is open source and available on Hugging Face. Implementation is straightforward—simply copy the provided snippet to incorporate the classifier into your flows.
Get the Code and Experiment Yourself
You can explore the project and set up the filter with ease through the following repository:
Post Comment