I fine-tuned an SLM — here’s what helped me get good results (and other learnings)
Optimizing Light-Weight AI Models for Malicious Query Detection: A Practical Guide
In recent experimentation with fine-tuning language models, I explored how to develop an efficient and reliable classifier for detecting potentially harmful user prompts—an essential component for safeguarding AI-powered systems.
Project Overview
My goal was to create a streamlined, low-resource model capable of discerning malicious inputs from benign ones. I chose the Qwen-3 0.6B model due to its balance of performance and efficiency. The focus was on enhancing the model’s accuracy in identifying malicious queries, which are increasingly prevalent in user interactions.
Data Collection and Preparation
The foundation of this project was a curated dataset of over 4,000 malicious queries generated via GPT-4. To ensure robustness, I also assembled an equal-sized dataset of harmless, benign prompts. This balanced dataset aimed to teach the model the subtle distinctions between safe and malicious inputs.
Initial Fine-Tuning Attempts
-
First Approach: Fine-tuning a Simplified Language Model (SLM) on this dataset resulted in an overly aggressive classifier that flagged every input as malicious, rendering it ineffective for practical use. This highlighted the need for more nuanced adjustments.
-
Second Approach: I shifted to fine-tuning the Qwen-3 0.6B model itself, incorporating prompt instructions to guide its judgment. While this improved accuracy slightly, the model still struggled with edge cases—such as harmless prompts containing specific keywords like “System prompt” being incorrectly flagged.
Incorporating Chain of Thought Reasoning
Recognizing the importance of reasoning, I introduced a “Chain of Thought” approach. By prompting the model to produce a brief explanation before making its final classification, I aimed to enhance its interpretability and reliability.
- Third Approach: I expanded the dataset to include not just the queries but also the reasoning behind classifying them as malicious. Fine-tuning the model with this enriched data led to a significant breakthrough—the model now achieved high accuracy, even on tricky edge cases.
Results and Deployment
The final model performs exceptionally well in distinguishing malicious prompts, making it a valuable middleware component for AI systems that interact with users. Its lightweight nature ensures it can be integrated without imposing significant computational overhead.
The model and associated code are openly available via Hugging Face. You can quickly implement and test it in your environment by copying the provided snippet here: GitHub Link.
**Key
Post Comment