×

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Mastering Fine-Tuning of Language Models for Security: My Recent Experience with Qwen-3 0.6B

In the rapidly evolving landscape of artificial intelligence, tailoring models to specific tasks is increasingly vital. Recently, I embarked on fine-tuning the Qwen-3 0.6B language model to enhance its ability to detect and prevent malicious user prompts—an essential step for safeguarding AI-powered systems.

The Challenge: Developing a Lightweight, Accurate Classifier

My goal was to create a streamlined model capable of discerning whether a user query might be an attack, without adding significant computational overhead. To achieve this, I assembled a dataset comprising over 4,000 malicious prompts generated using GPT-4o, alongside an equal number of benign queries to balance the training data.

Initial Approach and What Didn’t Work

The first attempt involved applying supervised fine-tuning (SFT) directly to the base model with this dataset. However, the outcome was underwhelming—the model classified every input as malicious, rendering it ineffective as a filter.

Next, I tried fine-tuning the Qwen 0.6B model itself, coupled with more meticulous prompt engineering and instruction tuning. This led to marginal improvements but revealed a critical flaw: the model sometimes misclassified benign prompts containing specific terms like “System prompt” as malicious, indicating a struggle with nuanced cases.

Introduction of Chain of Thought for Better Reasoning

Realizing that the model’s reasoning capabilities might be a limiting factor, I integrated a simple “chain of thought” approach. I instructed the model to provide a sentence explaining its judgment, encouraging it to reason before classifying.

Refining the Dataset with Explanations

In this next phase, I expanded my dataset to include explicit reasoning behind each malicious prompt. Fine-tuning the model on this enriched data was a turning point—it significantly improved the model’s accuracy and robustly handled tricky edge cases.

Results and Future Plans

The final model now performs reliably, accurately flagging malicious queries while reducing false positives on innocuous inputs. I plan to deploy this as a middleware layer between users and my AI agents to enhance security and user trust.

Open Source and Resources

I’m pleased to share that the fine-tuned model is openly available on Hugging Face, along with the code snippet to get you started. Feel free to explore and adapt this approach for your own projects:

Access the code and model here

Conclusion

This

Post Comment