How I went from 3 to 30 tok/sec without hardware upgrades
Supercharging Local AI Performance: My Journey from 3 to 30 Tokens per Second
In my quest to enhance the performance of my local AI workload, I faced significant dissatisfaction with the initial speeds my system could deliver. My setup, which consists of an LG Gram laptop featuring an Intel i7-1260P processor, 16 GB of DDR5 RAM, and an external RTX 3060 GPU (Razer Core X connected via Thunderbolt 3), seemed to be underperforming for the tasks at hand.
Initial Setup
I was operating on Windows 11 (version 24H2) with the NVIDIA driver 576.02, running LM Studio version 0.3.15 utilizing CUDA 12. My model of choice was the qwen3-14b (Q4_K_M), which has a context size of 16384 and fully leveraged GPU offloading. Despite these capable specifications, my performance was lagging behind expectations, with a disappointing 3 tokens per second at default settings, which could only be improved to about 6 tokens per second by enabling Flash Attention.
Moreover, the entire system seemed sluggish during regular activities. Determined to optimize my setup, I implemented a series of strategic adjustments that dramatically improved my output to an impressive 30 tokens per second, while also enhancing the overall user experience.
Optimization Steps
-
Display Connection: I switched from the HDMI output of my laptop to a direct connection via DisplayPort to the RTX, ensuring optimal bandwidth utilization.
-
Resolution Adjustment: Reducing my screen resolution from 4K to Full HD significantly conserved video memory, allowing the GPU to perform more efficiently.
-
Security Modifications: I disabled Windows Defender and disconnected from the internet, reducing background processes that could hinder performance.
-
USB Device Management: By removing all USB devices except for my mouse and keyboard, I eliminated a lag-inducing hub, specifically the Kingston UH1400P.
-
CPU Thread Pool Configuration: I limited the CPU thread pool size for the LLM model to 1, which optimized memory usage.
-
NVIDIA Driver Settings: I made several key adjustments within the NVIDIA control panel:
- Set the preferred graphics processor to the high-performance NVIDIA option, preventing Intel Graphics from inadvertently rendering desktop elements.
- Opted for the native Vulkan/OpenGL present method to enhance compatibility with LM Studio.
- Turned off
Post Comment