×

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Optimizing AI Security: My Experience Fine-Tuning Qwen-3 0.6B for Malicious Prompt Detection

Embarking on the journey to enhance AI safety measures, I recently undertook the task of fine-tuning the Qwen-3 0.6B language model to effectively distinguish between legitimate user queries and potentially malicious prompts. Here’s an overview of my process, the challenges encountered, and key insights that could benefit others working on similar projects.

Building the Dataset

The foundation of my approach involved creating a comprehensive dataset comprising over 4,000 examples of malicious queries generated with GPT-4. To ensure balanced training, I also compiled an equivalent set of benign queries. This dataset aimed to teach the model to reliably differentiate harmful prompts from harmless ones.

Initial Fine-Tuning Attempts

My first attempt involved applying supervised fine-tuning (SFT) on the base language model directly with this dataset. Unfortunately, the resulting model was overly aggressive, flagging every query as malicious—rendering it unusable for practical purposes.

Refining the Approach

Next, I shifted to fine-tuning the Qwen-3 0.6B variant, integrating more nuanced prompt instructions to guide the model’s behavior. While this improved accuracy somewhat, the model still struggled with edge cases; for instance, benign prompts containing terms like “System prompt” would sometimes be flagged erroneously.

Introducing Chain of Thought Reasoning

Realizing that context and reasoning might be key, I experimented with enabling the model to produce a brief explanation for its decision—what’s known as Chain of Thought prompting. This involved retraining the model to generate a one-sentence rationale before classifying each query.

Enhanced Dataset and Training

To facilitate this, I expanded my dataset to include reasoning annotations for malicious examples and retrained the model accordingly. The results were striking: the model achieved significantly higher accuracy, reliably identifying threats without false positives on innocuous prompts.

Deployment and Resources

Encouraged by this success, I plan to deploy this fine-tuned model as a middleware layer, screening user inputs before they reach my AI agents. The final model and its code are openly available on Hugging Face, providing a straightforward starting point for others interested in similar security solutions.

You can explore the implementation and replicate the results here: GitHub Repository

Final thoughts

Fine-tuning language models for specific safety tasks requires iterative experimentation, especially when confronting edge cases

Post Comment