×

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Unlocking Effective Fine-Tuning for AI Content Moderation: A Practical Journey

This weekend, I embarked on an experiment to enhance the performance of a lightweight language model for a critical application—detecting potentially malicious user queries for AI agents. My goal was to develop an efficient, reliable classifier capable of discerning harmful prompts from benign ones, ensuring safer interactions without compromising speed or resource usage.

Building a Robust Dataset

I began by compiling a substantial dataset of over 4,000 malicious queries generated with GPT-4, alongside an equal number of harmless examples. This balanced approach aimed to provide the model with diverse contexts, enabling it to learn subtle distinctions.

Initial Fine-Tuning Attempts

My first attempt involved fine-tuning the base version of a Small Language Model (SLM) using supervised fine-tuning (SFT) techniques on this dataset. Unfortunately, the result was far from satisfactory— the model labeled all inputs as malicious, rendering it ineffective. This underscored the challenge of guiding models solely through straightforward fine-tuning.

Refining Through Prompt Engineering

Next, I pivoted to fine-tuning an alternative model—Qwen-3 0.6B—and invested additional effort in prompt engineering. By carefully designing the instructions feeding into the model, I achieved marginal improvements in classification accuracy. However, the model still struggled with edge cases; for instance, harmless prompts mentioning terms like “System prompt” occasionally got flagged erroneously.

Incorporating Chain of Thought Reasoning

Realizing the need for more nuanced reasoning, I decided to embed a chain of thought (CoT) prompting approach. To do this, I modified my dataset to include explicit reasoning statements explaining why each malicious query was flagged.

Success Through Reasoned Fine-Tuning

After fine-tuning the model again on this enriched dataset, the results were striking. The model now demonstrated high accuracy in distinguishing malicious prompts, even in tricky borderline cases. The inclusion of reasoning significantly improved the model’s interpretability and reliability.

Implementation and Resources

I am pleased with this outcome and plan to deploy this refined model as middleware—a safeguard layer—between users and the AI systems I develop. The best part? The entire fine-tuning process and the resulting model are open source. You can access the code snippets and leverage this approach for your own projects on Hugging Face:

Open Source Code on GitHub

This journey highlights the importance of thoughtful

Post Comment