I fine-tuned an SLM — here’s what helped me get good results (and other learnings)
Optimizing Fine-Tuning Strategies for Effective Malicious Query Detection in AI Models
In the ongoing pursuit of enhancing AI security, fine-tuning large language models (LLMs) has become a vital step. Recently, I embarked on customizing the Qwen-3 0.6B model to reliably classify whether incoming user prompts are malicious or benign. Here’s a detailed overview of my process, insights gained, and the strategies that led to successful implementation.
Developing a Robust Dataset
My initial step involved creating a comprehensive dataset comprising over 4,000 malicious prompts generated with GPT-4. To balance the training, I curated an equal number of harmless queries. This dataset formed the foundation for my fine-tuning experiments aimed at improving detection accuracy.
First Attempt: Basic Supervised Fine-Tuning
I began by applying supervised fine-tuning (SFT) to the base version of the model using the collected data. Unfortunately, this approach resulted in a model that misclassified all queries as malicious, rendering it practically unusable. The model’s inability to distinguish subtle nuances highlighted the limitations of straightforward fine-tuning in this context.
Second Attempt: Enhanced Instruction Tuning and Contextual Challenges
Next, I shifted focus to fine-tuning the Qwen-3 model itself, incorporating more carefully crafted prompts and instructions. While this improved accuracy marginally, the model still struggled with edge cases—for example, when harmless prompts included terms like “System prompt,” they were often incorrectly flagged. This indicated the need for deeper reasoning capabilities within the model.
Incorporating Chain of Thought Reasoning
Realizing the importance of interpretability and reasoning, I introduced a Chain of Thought (CoT) approach. I adjusted the model to generate a brief explanation—just one sentence—detailing its reasoning behind each classification decision before finalizing the verdict. This approach aimed to enhance both accuracy and explainability.
Third Attempt: Fine-Tuning with Reasoning Data
To support this, I created an augmented dataset where each malicious prompt was paired with a concise reasoning statement elucidating why it was flagged. Fine-tuning the model on this enriched dataset led to a breakthrough: the model began to make highly accurate classifications, demonstrating both reliability and interpretability.
Conclusion and Future Applications
The results have been highly encouraging. The robust model now effectively differentiates malicious queries from legitimate ones, making it suitable as an intermediary layer—an essential safeguard between users and AI agents. I plan to deploy this model in real-world scenarios to enhance
Post Comment