My Experience Fine-Tuning an SLM: Key Factors for Success and Insights Gained

Artificial Intelligence GAIadmin August 2, 2025 0 Comments

My Experience Fine-Tuning an SLM: Key Factors for Success and Insights Gained

Enhancing Model Fine-Tuning: Insights from Developing a Malicious Query Classifier

In recent efforts to improve AI security, I dedicated time this weekend to fine-tuning the Qwen-3 0.6B language model. My goal was to develop a lightweight yet effective tool capable of discerning malicious prompts fed into my AI agents, ensuring a safer user experience.

Data Collection and Initial Attempts

My first step involved assembling a comprehensive dataset of over 4,000 malicious queries generated with GPT-4o. To maintain balance, I also compiled an equal number of benign queries. This dataset served as the foundation for training.

The initial approach, Supervised Fine-Tuning (SFT), involved training the model solely on this dataset. Unfortunately, the result was underwhelming—every input was flagged as malicious, rendering the model ineffective for real-world deployment.

Refining the Model with Prompt-Tuning and Edge Case Handling

Next, I shifted to fine-tuning the Qwen-3 0.6B model itself, incorporating prompt-tuning techniques to better guide its responses. While this improved accuracy somewhat, the model still struggled with nuanced cases. For example, benign prompts containing keywords like “System prompt” would incorrectly be flagged as malicious, highlighting the need for more sophisticated reasoning.

Introducing Chain of Thought for Better Discrimination

To address this, I experimented with embedding a reasoning process within the model’s predictions. I designed a dataset where each malicious query was accompanied by a brief explanation—essentially asking the model to “think aloud” before concluding. By fine-tuning the model with these reasoning annotations, I observed a significant accuracy boost, especially in tricky edge cases.

Results and Deployment Plans

This iterative process culminated in a model that reliably distinguishes malicious queries from harmless ones. I’m pleased with the performance and plan to integrate this as a middleware layer between users and my AI agents, enhancing security and trustworthiness.

Open Source Access

For those interested in experimenting or building upon this work, the final model and code are openly available on Hugging Face. You can set it up effortlessly by copying and pasting the provided implementation snippet:

https://github.com/sarthakrastogi/rival

By sharing these insights, I hope others can replicate and adapt this approach to improve AI safety in their own projects.