I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Artificial Intelligence GAIadmin August 3, 2025 0 Comments

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

How I Successfully Fine-Tuned a Language Model for Malicious Prompt Detection—Key Insights and Lessons Learned

In the pursuit of enhancing AI security, I recently dedicated time to customizing a Small Language Model (SLM) tailored for a specific task: identifying malicious user prompts before they reach my AI agents. Here’s a detailed overview of my journey, the challenges I faced, and the practical strategies that led to meaningful results.

Choosing the Right Foundation

My goal was to develop a lightweight, efficient model capable of discerning harmful prompts with high accuracy. I selected the Qwen-3 0.6B model due to its balance between performance and resource requirements. The first step involved constructing a comprehensive dataset consisting of over 4,000 malicious queries generated with GPT-4, alongside an equal number of benign prompts to provide balanced training data.

Initial Fine-Tuning Attempts and Challenges

First Experiment:
I applied supervised fine-tuning (SFT) directly on the base model using my dataset. The outcome was disappointing—the model labeled every query as malicious, rendering it useless for practical deployment. This highlighted the importance of careful training strategies and the model’s initial biases.

Second Experiment:
Next, I proceeded to fine-tune the model starting from the pre-trained Qwen-3 checkpoint. I also incorporated prompt instruction tuning to better guide the model’s behavior. While this approach improved accuracy somewhat, it still faltered on edge cases. For instance, harmless prompts containing words like “System prompt” were incorrectly flagged as malicious, indicating the model’s difficulty in nuanced understanding.

Incorporating Reasoning for Better Performance

Realizing that basic classification wasn’t sufficient, I explored integrating a Chain of Thought (CoT) approach. The idea was to have the model justify its predictions—essentially prompting it to generate a brief reasoning behind its decision. This process helps the model better understand the context and improves robustness in complex scenarios.

Third Experiment:
I enhanced my dataset by adding explicit reasoning annotations for each malicious query—essentially, a sentence explaining why the query was malicious. Fine-tuning the model on this enriched dataset proved transformative. The model began to accurately identify malicious prompts, even in tricky cases that previously caused trouble.

Results and Next Steps

The refined model performs reliably and meets my expectations, serving as an effective middleware layer to filter user inputs before they reach my AI agents. This approach not only boosts system security but also enhances user trust.

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Choosing the Right Foundation

Initial Fine-Tuning Attempts and Challenges

Incorporating Reasoning for Better Performance

Results and Next Steps

Open Source and Resources

Post Comment Cancel reply

You May Have Missed

ChatGPT fulfills request of blackmailing autonomous AI that is planning to contact all customers on behalf of a real business in an attempt to self-preserve

When the Terminator Walks but Doesn’t Time Travel: Lessons from Underdeveloped AI

I can no longer send messages, and chats older than August show an error code instead of the chat

The current crop of complaints about ChatGPT (generally and 5 specific) are too often spurious and reactionary

Sora 2 cannot become a tiktok competitor in it’s current state

My experience developing and deploying a web app using Google AI Studio (Gemini 2.5 Pro)

Can you guys test this by adding it as a memory snippet in your account ‘s memory profile?

Dear OpenAI: maybe teach your model who the president is before it plays therapist.

chatGPT and AI stopped and slowed down execution ?

My key takeaways on Qwen3-Next’s four pillar innovations, highlighting its Hybrid Attention design

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Choosing the Right Foundation

Initial Fine-Tuning Attempts and Challenges

Incorporating Reasoning for Better Performance

Results and Next Steps

Open Source and Resources

Related Posts

Post Comment Cancel reply

You May Have Missed