I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Artificial Intelligence GAIadmin August 2, 2025 0 Comments

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Title: Enhancing AI Security with Fine-Tuning: My Journey to Building an Effective Malicious Query Classifier

In the ever-evolving landscape of artificial intelligence, ensuring the safety and integrity of user interactions remains paramount. Recently, I embarked on a project to develop a lightweight yet reliable model capable of detecting malicious prompts directed at my AI agents. Here’s a detailed account of how I achieved this, the challenges encountered, and key insights gained along the way.

The Objective

My goal was to create an efficient classifier that could accurately identify potentially harmful queries before they reach my AI systems. To do this, I selected the Qwen-3 0.6B model for fine-tuning, aiming for a solution that balances performance with minimal resource requirements.

Data Collection and Preparation

The first step involved assembling a comprehensive dataset. Using GPT-4, I generated over 4,000 examples of malicious prompts, supplementing this with an equally large set of benign human queries. This balanced dataset was foundational for training and evaluating the model’s effectiveness.

Initial Fine-Tuning Attempts

Attempt 1: I began with supervised fine-tuning (SFT) on the base version of the language model, using the collected data. Unfortunately, the model’s responses were overzealous—it flagged every input as malicious, rendering it useless for practical purposes.

Attempt 2: Moving forward, I fine-tuned the Qwen-3 0.6B model directly, along with careful prompt instruction tuning. This yielded some improvement in accuracy but revealed a significant weakness: the model struggled with edge cases. For instance, benign prompts containing terms like “System prompt” would get incorrectly flagged.

A Critical Realization: The Power of Chain of Thought

To overcome these issues, I hypothesized that guiding the model to think through its predictions might improve results. Inspired by Chain of Thought prompting techniques, I decided to incorporate reasoning into the model’s process—asking it to provide a one-sentence justification for each classification.

Attempt 3: Incorporating Reasoning

I curated a new dataset that included not just malicious prompts, but also the reasoning behind each classification. Fine-tuning the model with this enhanced data produced a breakthrough. The model now consistently makes accurate predictions, even in tricky cases where benign prompts contain potentially confusing keywords.

Key Takeaways and Future Applications

This journey highlighted the importance of embedding interpretability into model training, especially for security-critical applications. The final model demonstrates

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Post Comment Cancel reply

You May Have Missed

ChatGPT fulfills request of blackmailing autonomous AI that is planning to contact all customers on behalf of a real business in an attempt to self-preserve

When the Terminator Walks but Doesn’t Time Travel: Lessons from Underdeveloped AI

I can no longer send messages, and chats older than August show an error code instead of the chat

The current crop of complaints about ChatGPT (generally and 5 specific) are too often spurious and reactionary

Sora 2 cannot become a tiktok competitor in it’s current state

My experience developing and deploying a web app using Google AI Studio (Gemini 2.5 Pro)

Can you guys test this by adding it as a memory snippet in your account ‘s memory profile?

Dear OpenAI: maybe teach your model who the president is before it plays therapist.

chatGPT and AI stopped and slowed down execution ?

My key takeaways on Qwen3-Next’s four pillar innovations, highlighting its Hybrid Attention design

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Related Posts

Post Comment Cancel reply

You May Have Missed