×

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Enhancing AI Security: Fine-Tuning a Lightweight Model for Malicious Query Detection

In recent developments, I embarked on a project to refine a small language model (SLM) for a critical security purpose—identifying potentially malicious user inputs in AI-driven applications. My goal was to develop an efficient, lightweight classifier capable of swiftly flagging harmful prompts before they reach the core AI agents.

Building the Dataset

The foundation of this project was assembling a robust dataset. Using GPT-4, I generated over 4,000 examples of malicious queries, complemented by an equally large set of benign samples. This balance aimed to help the model distinguish malicious intent from harmless inquiries effectively.

Initial Approach: Supervised Fine-Tuning

My first attempt involved applying supervised fine-tuning (SFT) on the base language model with the collected data. Unfortunately, the outcome was not promising—the model classified all inputs as malicious, rendering it unusable for practical purposes.

Refining the Method: Model Selection and Prompt Engineering

Next, I shifted focus to the Qwen-3 0.6B model, adjusting the prompt instructions during fine-tuning. This iteration showed a slight improvement, but the model still fell short in edge cases—for instance, benign prompts containing specific keywords like “System prompt” would erroneously be flagged.

Incorporating Chain of Thought Reasoning

Realizing that the model needed a deeper contextual understanding, I decided to incorporate reasoning capabilities. I created a new dataset that included not just malicious queries but also the reasoning behind each classification. By fine-tuning the model to produce a single sentence of justification before making a decision, I enabled it to reason more effectively.

Breakthrough Result

This approach proved transformative. The model’s accuracy in detecting malicious prompts improved dramatically, and it now performs reliably across various scenarios. I’m pleased with these results and plan to deploy this model as a middleware layer—screening user inputs before they reach my AI agents.

Open Source and Implementation

The final model, along with the code I developed, is available on Hugging Face. For those interested, you can quickly get started by copying and pasting the provided snippet:

Link to GitHub Repository

By sharing this experience, I hope others working on similar security challenges can benefit from the strategies that led to my success—particularly the importance of explanatory reasoning in model fine-tuning.

Post Comment