My Experience Fine-Tuning an SLM: Key Factors for Success and Insights Gained

Artificial Intelligence GAIadmin August 2, 2025 0 Comments

My Experience Fine-Tuning an SLM: Key Factors for Success and Insights Gained

Optimizing Light-Weight AI Models for Malicious Query Detection: A Practical Guide

In recent experimentation with fine-tuning language models, I explored how to develop an efficient and reliable classifier for detecting potentially harmful user prompts—an essential component for safeguarding AI-powered systems.

Project Overview

My goal was to create a streamlined, low-resource model capable of discerning malicious inputs from benign ones. I chose the Qwen-3 0.6B model due to its balance of performance and efficiency. The focus was on enhancing the model’s accuracy in identifying malicious queries, which are increasingly prevalent in user interactions.

Data Collection and Preparation

The foundation of this project was a curated dataset of over 4,000 malicious queries generated via GPT-4. To ensure robustness, I also assembled an equal-sized dataset of harmless, benign prompts. This balanced dataset aimed to teach the model the subtle distinctions between safe and malicious inputs.

Initial Fine-Tuning Attempts

First Approach: Fine-tuning a Simplified Language Model (SLM) on this dataset resulted in an overly aggressive classifier that flagged every input as malicious, rendering it ineffective for practical use. This highlighted the need for more nuanced adjustments.
Second Approach: I shifted to fine-tuning the Qwen-3 0.6B model itself, incorporating prompt instructions to guide its judgment. While this improved accuracy slightly, the model still struggled with edge cases—such as harmless prompts containing specific keywords like “System prompt” being incorrectly flagged.

Incorporating Chain of Thought Reasoning

Recognizing the importance of reasoning, I introduced a “Chain of Thought” approach. By prompting the model to produce a brief explanation before making its final classification, I aimed to enhance its interpretability and reliability.

Third Approach: I expanded the dataset to include not just the queries but also the reasoning behind classifying them as malicious. Fine-tuning the model with this enriched data led to a significant breakthrough—the model now achieved high accuracy, even on tricky edge cases.

Results and Deployment

The final model performs exceptionally well in distinguishing malicious prompts, making it a valuable middleware component for AI systems that interact with users. Its lightweight nature ensures it can be integrated without imposing significant computational overhead.

The model and associated code are openly available via Hugging Face. You can quickly implement and test it in your environment by copying the provided snippet here: GitHub Link.

**Key