×

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Optimizing Large Language Models for Enhanced Security: My Fine-Tuning Journey

This weekend, I embarked on an experiment to refine a Qwen-3 0.6B language model, aiming to develop a lightweight classifier capable of detecting malicious user prompts in AI interactions. The goal was to create an efficient tool that enhances the security of AI-powered applications by accurately identifying potential attacks.

Data Collection and Initial Approach

My first step involved compiling a robust dataset of over 4,000 malicious queries generated with GPT-4o. To balance the training process, I also assembled an equal-sized set of benign, harmless prompts. This balanced dataset served as the foundation for initial training experiments.

First Fine-Tuning Attempt: Challenges encountered

I began fine-tuning the base version of the model using supervised fine-tuning (SFT) with this dataset. However, the results were disappointing — the model ended up classifying nearly every input as malicious, rendering it ineffective for practical use.

Second Attempt: Adjusting training strategies

Next, I shifted to fine-tuning the Qwen-3-0.6B model itself. In addition to the dataset, I also invested time into prompt engineering—refining the instructions given to the model during training. While this approach improved accuracy slightly, it still struggled with nuanced cases. For instance, harmless prompts containing specific keywords like “System prompt” were incorrectly flagged as malicious.

Incorporating Chain of Thought reasoning

Recognizing the need for deeper contextual understanding, I decided to incorporate a reasoning component into the model. Specifically, I instructed the model to articulate a single sentence explaining its classification decision. This approach aimed to guide the model toward more thoughtful evaluations.

Final Fine-Tuning: Adding reasoning to improve accuracy

To implement this, I expanded the dataset to include explicit reasoning for each malicious prompt. Fine-tuning the model with this enriched data resulted in a significant performance boost. The model now reliably identifies malicious queries with high accuracy, even in tricky edge cases.

Conclusion and Future Applications

This experience underscored the importance of incorporating reasoning capabilities into language models for complex classification tasks. I am now planning to deploy this refined model as a middleware layer—acting as a safeguard between users and AI agents to filter out malicious inputs effectively.

Open Source Access

The final version of the model is publicly available on Hugging Face, along with all the code used in this process. If you’re interested in implementing similar solutions or exploring advanced fine-tuning techniques, you can find the

Post Comment