Artificial Intelligence GAIadmin May 11, 2025 0 Comments

How I went from 3 to 30 tok/sec without hardware upgrades

Supercharging Local AI Performance: My Journey from 3 to 30 Tokens per Second

In my quest to enhance the performance of my local AI workload, I faced significant dissatisfaction with the initial speeds my system could deliver. My setup, which consists of an LG Gram laptop featuring an Intel i7-1260P processor, 16 GB of DDR5 RAM, and an external RTX 3060 GPU (Razer Core X connected via Thunderbolt 3), seemed to be underperforming for the tasks at hand.

Initial Setup

I was operating on Windows 11 (version 24H2) with the NVIDIA driver 576.02, running LM Studio version 0.3.15 utilizing CUDA 12. My model of choice was the qwen3-14b (Q4_K_M), which has a context size of 16384 and fully leveraged GPU offloading. Despite these capable specifications, my performance was lagging behind expectations, with a disappointing 3 tokens per second at default settings, which could only be improved to about 6 tokens per second by enabling Flash Attention.

Moreover, the entire system seemed sluggish during regular activities. Determined to optimize my setup, I implemented a series of strategic adjustments that dramatically improved my output to an impressive 30 tokens per second, while also enhancing the overall user experience.

Optimization Steps

Display Connection: I switched from the HDMI output of my laptop to a direct connection via DisplayPort to the RTX, ensuring optimal bandwidth utilization.
Resolution Adjustment: Reducing my screen resolution from 4K to Full HD significantly conserved video memory, allowing the GPU to perform more efficiently.
Security Modifications: I disabled Windows Defender and disconnected from the internet, reducing background processes that could hinder performance.
USB Device Management: By removing all USB devices except for my mouse and keyboard, I eliminated a lag-inducing hub, specifically the Kingston UH1400P.
CPU Thread Pool Configuration: I limited the CPU thread pool size for the LLM model to 1, which optimized memory usage.
NVIDIA Driver Settings: I made several key adjustments within the NVIDIA control panel:
Set the preferred graphics processor to the high-performance NVIDIA option, preventing Intel Graphics from inadvertently rendering desktop elements.
Opted for the native Vulkan/OpenGL present method to enhance compatibility with LM Studio.
Turned off

How I went from 3 to 30 tok/sec without hardware upgrades

Supercharging Local AI Performance: My Journey from 3 to 30 Tokens per Second

Initial Setup

Optimization Steps

Post Comment Cancel reply

You May Have Missed

ChatGPT fulfills request of blackmailing autonomous AI that is planning to contact all customers on behalf of a real business in an attempt to self-preserve

When the Terminator Walks but Doesn’t Time Travel: Lessons from Underdeveloped AI

I can no longer send messages, and chats older than August show an error code instead of the chat

The current crop of complaints about ChatGPT (generally and 5 specific) are too often spurious and reactionary

Sora 2 cannot become a tiktok competitor in it’s current state

My experience developing and deploying a web app using Google AI Studio (Gemini 2.5 Pro)

Can you guys test this by adding it as a memory snippet in your account ‘s memory profile?

Dear OpenAI: maybe teach your model who the president is before it plays therapist.

chatGPT and AI stopped and slowed down execution ?

My key takeaways on Qwen3-Next’s four pillar innovations, highlighting its Hybrid Attention design

How I went from 3 to 30 tok/sec without hardware upgrades

Supercharging Local AI Performance: My Journey from 3 to 30 Tokens per Second

Initial Setup

Optimization Steps

Related Posts

Post Comment Cancel reply

You May Have Missed