×

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Title: Enhancing AI Security with Fine-Tuning: My Journey to Building an Effective Malicious Query Classifier

In the ever-evolving landscape of artificial intelligence, ensuring the safety and integrity of user interactions remains paramount. Recently, I embarked on a project to develop a lightweight yet reliable model capable of detecting malicious prompts directed at my AI agents. Here’s a detailed account of how I achieved this, the challenges encountered, and key insights gained along the way.

The Objective

My goal was to create an efficient classifier that could accurately identify potentially harmful queries before they reach my AI systems. To do this, I selected the Qwen-3 0.6B model for fine-tuning, aiming for a solution that balances performance with minimal resource requirements.

Data Collection and Preparation

The first step involved assembling a comprehensive dataset. Using GPT-4, I generated over 4,000 examples of malicious prompts, supplementing this with an equally large set of benign human queries. This balanced dataset was foundational for training and evaluating the model’s effectiveness.

Initial Fine-Tuning Attempts

Attempt 1: I began with supervised fine-tuning (SFT) on the base version of the language model, using the collected data. Unfortunately, the model’s responses were overzealous—it flagged every input as malicious, rendering it useless for practical purposes.

Attempt 2: Moving forward, I fine-tuned the Qwen-3 0.6B model directly, along with careful prompt instruction tuning. This yielded some improvement in accuracy but revealed a significant weakness: the model struggled with edge cases. For instance, benign prompts containing terms like “System prompt” would get incorrectly flagged.

A Critical Realization: The Power of Chain of Thought

To overcome these issues, I hypothesized that guiding the model to think through its predictions might improve results. Inspired by Chain of Thought prompting techniques, I decided to incorporate reasoning into the model’s process—asking it to provide a one-sentence justification for each classification.

Attempt 3: Incorporating Reasoning

I curated a new dataset that included not just malicious prompts, but also the reasoning behind each classification. Fine-tuning the model with this enhanced data produced a breakthrough. The model now consistently makes accurate predictions, even in tricky cases where benign prompts contain potentially confusing keywords.

Key Takeaways and Future Applications

This journey highlighted the importance of embedding interpretability into model training, especially for security-critical applications. The final model demonstrates

Post Comment


You May Have Missed