I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Artificial Intelligence GAIadmin August 3, 2025 0 Comments

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Enhancing AI Security: Fine-Tuning a Lightweight Model for Malicious Query Detection

In the quest to bolster AI security, I recently undertook the task of fine-tuning a small language model to reliably identify potentially harmful user inputs. My goal was to develop an efficient, lightweight model capable of discerning malicious prompts that could threaten the integrity of AI agents I work with.

Data Collection and Preparation

The foundation of this project involved compiling a robust dataset: over 4,000 malicious queries generated via GPT-4, paired with a similarly sized set of benign, harmless inputs. This balanced approach aimed to provide the model with clear examples of both classes to learn from effectively.

Initial Training Attempts and Challenges

My first approach involved applying supervised fine-tuning (SFT) directly to the base version of the small language model (SLM) using this dataset. Unfortunately, the outcome was far from what I anticipated—the model tended to classify every query as malicious, rendering it practically unusable.

Subsequently, I refined my strategy by fine-tuning a more specific model variant—Qwen-3 0.6B—and also invested additional effort into prompt engineering, providing clearer instructions during training. This led to a modest increase in accuracy, yet the model still faltered when encountering subtle cases. For example, harmless prompts mentioning “System prompt” occasionally got misclassified as malicious.

Incorporating Chain of Thought for Better Precision

Realizing that the model needed a deeper reasoning process, I decided to embrace Chain of Thought (CoT) prompting techniques. The idea was to have the model generate a sentence explaining its rationale before making a final judgment.

To facilitate this, I curated a new dataset where each malicious query was annotated with a brief explanation of why it was harmful. Fine-tuning on this enriched dataset yielded a breakthrough: the model now classifies queries with a high degree of accuracy, even in edge cases. This “aha” moment confirmed that encouraging the model to reason step-by-step significantly improved its reliability.

Deployment and Open Source Availability

I’m now planning to deploy this model as an intermediary filter between users and AI agents, enhancing overall security by preemptively flagging malicious inputs.

For those interested in experimenting with this setup, the complete model and code are openly available on Hugging Face. Simply copy and run the provided snippets to integrate the model into your workflow:

View the implementation on GitHub

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Post Comment Cancel reply

You May Have Missed

ChatGPT fulfills request of blackmailing autonomous AI that is planning to contact all customers on behalf of a real business in an attempt to self-preserve

When the Terminator Walks but Doesn’t Time Travel: Lessons from Underdeveloped AI

I can no longer send messages, and chats older than August show an error code instead of the chat

The current crop of complaints about ChatGPT (generally and 5 specific) are too often spurious and reactionary

Sora 2 cannot become a tiktok competitor in it’s current state

My experience developing and deploying a web app using Google AI Studio (Gemini 2.5 Pro)

Can you guys test this by adding it as a memory snippet in your account ‘s memory profile?

Dear OpenAI: maybe teach your model who the president is before it plays therapist.

chatGPT and AI stopped and slowed down execution ?

My key takeaways on Qwen3-Next’s four pillar innovations, highlighting its Hybrid Attention design

I fine-tuned an SLM — here’s what helped me get good results (and other learnings)

Related Posts

Post Comment Cancel reply

You May Have Missed