Beyond the 3090: Scaling Local AI with Qwen 35B and Gemma 4 on Dual GPUs

When I first started experimenting with local LLM inference, I was content pushing a single NVIDIA RTX 3090 to its absolute limits. I could load Qwen 3.6 27B at Q4 quantization, keep context windows around 100k tokens. The value was undeniable. I had replaced a $250/month cloud AI subscription with a consumer-grade GPU running quietly under my desk.

It worked. But it wasn't enough.

The Problem With One GPU

The 3090 was great at one thing: running one heavy model for one purpose. I used it for coding work. I used it for agentic tasks. But I also wanted to run models locally for my family — for private, everyday use — without dedicating the entire card to a light workload.

With one GPU, there was no room to experiment. No room for a second model. No room for the kind of flexibility that turns a local AI setup into a genuine family-and-work resource.

So I did what any engineer would do: I added a second GPU.

The Hardware: Two Different Cards, One Purpose

My current rig features an AMD Ryzen 9 3950X (16C/32T), 128GB DDR4, and two discrete GPUs:

Bottom slot: NVIDIA RTX 3090 (24 GB GDDR6X) — the workhorse
Top slot: Intel Arc Pro B70 (16 GB GDDR6) — the wildcard

These are completely different architectures. The 3090 runs on CUDA. The Arc B70 runs on Intel's oneAPI/SYCL stack. This isn't a budget compromise. It's intentional.

Why Splitting a Model Across Heterogeneous GPUs Is a Terrible Idea

I considered splitting a single model across both GPUs. I researched it. I understood the theory. Then I understood the reality.

1. Heterogeneous Toolchains. Compiling llama.cpp to coordinate unified memory layers across both CUDA and SYCL runtimes is highly complex and unsupported by mainstream compilers. The ecosystems don't talk to each other at the layer that matters.

2. PCIe Bus Overhead. Pushing tensor outputs back and forth between different GPUs on every layer evaluation introduces latency that completely negates GPU acceleration. You're not accelerating computation; you're bottlenecking it.

3. VRAM Imbalance. The RTX 3090 has a 384-bit memory bus at 936 GB/s. The Intel Arc Pro B70 has a 256-bit bus at GDDR6. Splitting a model would throttle the 3090 to match the slower memory bandwidth of the Arc. You'd be dragging the faster card down to the speed of the slower one.

Decision: I partition the GPUs cleanly instead:

RTX 3090 (CUDA): Runs Qwen3.6-35B-A3B-UD-Q4_K_M.gguf with 128k context, one slot, maximum concurrency depth
Intel Arc Pro B70 (SYCL): Runs gemma-4-E4B-it-UD-Q8_K_XL.gguf with 512k context, four parallel slots for high-concurrency workloads

The Thermal Trade-Off: RTX 3090 in a Bottom x8 Slot

The 3090 is a 350W TDP, triple-slot card with large intake fans on its underside. It eats air.

If I install the 3090 in the top PCIe x16 slot (next to the CPU), those intake fans sit directly above the Intel Arc card with barely any clearance. The 3090 starves for cool air, the Arc bakes, and thermal throttling becomes the default state of affairs.

So I put the 3090 in the bottom PCIe x8 slot instead. The fans face open airflow from the case floor and intake vents. Better cooling. Lower thermals. Longer card life.

The trade-off? Bandwidth drops from x16 (~~32 GB/s) to x8 (~~16 GB/s). But that only affects model loading time — the one-time operation when the server boots or a model is swapped. Once the weights are in VRAM, all inference happens entirely on-card with zero PCIe traffic. Slower boot is a small price for stable operation.

Running Two Models, Two Context Windows, Two Entire Ecosystems

The architecture is asymmetric by design, and that's the point.

The CUDA instance (Qwen-35B on the 3090) is configured for depth. Single parallel slot. Full 128k context per conversation. When I send a massive codebase dump — files, diffs, project maps — the entire available VRAM is devoted to one deep conversation. KV caches are quantized to 4-bit (q4_0), reducing the 128k cache footprint from ~32 GB down to ~8 GB. Flash attention is enabled to minimize memory overhead. --context-shift handles context window overflow by rolling turns without full re-evaluation.

The SYCL instance (Gemma-4 on the Arc B70) is configured for breadth. 512k context pool divided across 4 concurrent slots (128k each). This means up to four parallel queries without queuing — perfect for lighter family use or background agent tasks. LiteLLM routes endpoints into three reasoning tiers: flash (instant, cheap), pro (medium reasoning budget), and pro-extended (unbounded thinking).

Both run full VRAM offload. Both are accessed through a LiteLLM gateway at localhost:4000 that translates standard OpenAI-compatible API calls into GPU-specific requests.

The Full Stack, Locally Hosted

This isn't a raw llama.cpp script in a terminal. It's a production-grade local AI platform:

LibreChat serves as the primary web interface at ai.theclarks.dev
Hermes Agent Dashboard at hermes.theclarks.dev for autonomous AI agent management
Netdata at mon.theclarks.dev for real-time system monitoring
Cockpit at kvm.theclarks.dev for KVM virtualization and server management
GitLab CE at git.theclarks.dev for self-hosted DevOps

Every service is bound to 127.0.0.1 and accessible exclusively through an encrypted Tailscale overlay network. The DNS domains resolve to private Tailscale IPs — the server is unreachable from the public internet. SSL certificates are issued via Let's Encrypt using Cloudflare's DNS-01 challenge.

LibreChat integrates MCP (Model Context Protocol) servers that give the LLMs tool execution capabilities: filesystem access, web fetching, headless browser automation via Playwright, semantic memory via a knowledge graph, and structured chain-of-thought reasoning through the Sequential Thinking MCP.

Reliability Is a Feature

A local AI system that crashes when you need it isn't useful. So the entire stack runs with automatic recovery:

Every service uses Restart=always with RestartSec=10 — if a backend node crashes, the OS restarts it within 10 seconds
All services are enabled at boot in systemd's multi-user target sequence
Critical packages (drivers, CUDA, oneAPI, MongoDB) are held via apt-mark hold to prevent breaking upgrades
Daily offsite backups of databases and configurations sync to Google Drive via rclone, with 3-day local and 30-day cloud retention
Journal logs are capped at 4GB to prevent LLM logging from consuming the root partition

What's Next

The dual-GPU setup solved the real problem: flexibility. The 3090 handles deep, single-threaded coding workloads for work. The Arc B70 handles lighter, multi-threaded tasks for family use and experimentation. Together, they cover the entire spectrum of what I need from a local AI system.

The Intel Arc B70 is still a relatively new player in the GPU space. SYCL support for LLM inference is functional but evolving. I'm monitoring its long-term stability and driver support closely. If it holds up, it's a remarkably cost-effective secondary accelerator. If it doesn't, the 3090 remains a fully capable single-GPU setup.

What originally started as a weekend project to benchmark an RTX 3090 has evolved into a full production-grade local AI infrastructure. We're not just close to cloud-level performance running under our desks. We're already there.

Acknowledgments

Alex Ziskind (@alexziskind1) — His YouTube videos and tuning scripts were instrumental in the decision-making and architectural design of this system, particularly around local LLM inference optimization and server configuration.

Kris Clark | Solutions Architect | Tech Enthusiast | DIY Builder

Beyond the 3090: My Dual-GPU Local AI Architecture for Qwen 35B and Gemma 4