×

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Optimizing AI Classification: My Journey Fine-Tuning the Qwen-3 0.6B Model for Malicious Query Detection

In the realm of AI development, achieving precise and reliable models is paramount—especially when safeguarding systems against malicious inputs. Recently, I undertook an in-depth project to fine-tune the Qwen-3 0.6B language model to effectively identify potentially harmful user prompts. Here’s an overview of my process, challenges, and key insights that guided me toward success.

Defining the Objective

My goal was clear: develop a lightweight, accurate classifier capable of discerning malicious queries to prevent attacks on my AI agents. I aimed for a solution that could integrate seamlessly as a middleware layer, filtering harmful inputs before they reach the core AI systems.

Data Collection and Preparation

To train the model, I curated a comprehensive dataset consisting of over 4,000 examples of malicious queries generated with GPT-4o. Alongside these, I compiled a balanced set of 4,000 harmless prompts, ensuring the model could learn to differentiate effectively.

Initial Fine-Tuning Attempts

My first approach involved supervised fine-tuning of the base version of the Small Language Model (SLM) directly on the dataset. Unfortunately, this resulted in the model classifying every input as malicious, rendering it ineffective. It was an early indication that the task required a more nuanced approach.

Next, I shifted my focus to fine-tuning the Qwen-3 0.6B model itself. I also devoted additional effort to refining the prompt instructions, aiming to guide the model more precisely. While this improved accuracy somewhat, the model still struggled with edge cases—for example, harmless prompts containing specific terms like “System prompt” occasionally triggered false positives.

Introducing Chain of Thought Reasoning

Recognizing the limitations, I hypothesized that incorporating reasoning steps could enhance the model’s decision-making process. I started by prompting the model to generate a single sentence of reasoning before classifying each query. This technique, known as Chain of Thought prompting, proved promising.

To support this, I created a new dataset that included not only malicious examples but also the reasoning behind each classification. Fine-tuning the model on this enriched data yielded a significant breakthrough: the model’s predictions became highly accurate and robust, even on edge cases.

Final Results and Implementation

With this approach, I achieved the performance needed for practical deployment. The final model can reliably flag malicious prompts, acting as an effective

Post Comment