×

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

How I Successfully Fine-Tuned a Language Model for Malicious Prompt Detection—Key Insights and Lessons Learned

In the pursuit of enhancing AI security, I recently dedicated time to customizing a Small Language Model (SLM) tailored for a specific task: identifying malicious user prompts before they reach my AI agents. Here’s a detailed overview of my journey, the challenges I faced, and the practical strategies that led to meaningful results.

Choosing the Right Foundation

My goal was to develop a lightweight, efficient model capable of discerning harmful prompts with high accuracy. I selected the Qwen-3 0.6B model due to its balance between performance and resource requirements. The first step involved constructing a comprehensive dataset consisting of over 4,000 malicious queries generated with GPT-4, alongside an equal number of benign prompts to provide balanced training data.

Initial Fine-Tuning Attempts and Challenges

First Experiment:
I applied supervised fine-tuning (SFT) directly on the base model using my dataset. The outcome was disappointing—the model labeled every query as malicious, rendering it useless for practical deployment. This highlighted the importance of careful training strategies and the model’s initial biases.

Second Experiment:
Next, I proceeded to fine-tune the model starting from the pre-trained Qwen-3 checkpoint. I also incorporated prompt instruction tuning to better guide the model’s behavior. While this approach improved accuracy somewhat, it still faltered on edge cases. For instance, harmless prompts containing words like “System prompt” were incorrectly flagged as malicious, indicating the model’s difficulty in nuanced understanding.

Incorporating Reasoning for Better Performance

Realizing that basic classification wasn’t sufficient, I explored integrating a Chain of Thought (CoT) approach. The idea was to have the model justify its predictions—essentially prompting it to generate a brief reasoning behind its decision. This process helps the model better understand the context and improves robustness in complex scenarios.

Third Experiment:
I enhanced my dataset by adding explicit reasoning annotations for each malicious query—essentially, a sentence explaining why the query was malicious. Fine-tuning the model on this enriched dataset proved transformative. The model began to accurately identify malicious prompts, even in tricky cases that previously caused trouble.

Results and Next Steps

The refined model performs reliably and meets my expectations, serving as an effective middleware layer to filter user inputs before they reach my AI agents. This approach not only boosts system security but also enhances user trust.

Open Source and Resources

Post Comment