×

How I went from 3 to 30 tok/sec without hardware upgrades

How I went from 3 to 30 tok/sec without hardware upgrades

Supercharging Local AI Performance: My Journey from 3 to 30 Tokens per Second

In my quest to enhance the performance of my local AI workload, I faced significant dissatisfaction with the initial speeds my system could deliver. My setup, which consists of an LG Gram laptop featuring an Intel i7-1260P processor, 16 GB of DDR5 RAM, and an external RTX 3060 GPU (Razer Core X connected via Thunderbolt 3), seemed to be underperforming for the tasks at hand.

Initial Setup

I was operating on Windows 11 (version 24H2) with the NVIDIA driver 576.02, running LM Studio version 0.3.15 utilizing CUDA 12. My model of choice was the qwen3-14b (Q4_K_M), which has a context size of 16384 and fully leveraged GPU offloading. Despite these capable specifications, my performance was lagging behind expectations, with a disappointing 3 tokens per second at default settings, which could only be improved to about 6 tokens per second by enabling Flash Attention.

Moreover, the entire system seemed sluggish during regular activities. Determined to optimize my setup, I implemented a series of strategic adjustments that dramatically improved my output to an impressive 30 tokens per second, while also enhancing the overall user experience.

Optimization Steps

  1. Display Connection: I switched from the HDMI output of my laptop to a direct connection via DisplayPort to the RTX, ensuring optimal bandwidth utilization.

  2. Resolution Adjustment: Reducing my screen resolution from 4K to Full HD significantly conserved video memory, allowing the GPU to perform more efficiently.

  3. Security Modifications: I disabled Windows Defender and disconnected from the internet, reducing background processes that could hinder performance.

  4. USB Device Management: By removing all USB devices except for my mouse and keyboard, I eliminated a lag-inducing hub, specifically the Kingston UH1400P.

  5. CPU Thread Pool Configuration: I limited the CPU thread pool size for the LLM model to 1, which optimized memory usage.

  6. NVIDIA Driver Settings: I made several key adjustments within the NVIDIA control panel:

  7. Set the preferred graphics processor to the high-performance NVIDIA option, preventing Intel Graphics from inadvertently rendering desktop elements.
  8. Opted for the native Vulkan/OpenGL present method to enhance compatibility with LM Studio.
  9. Turned off

Post Comment