AI@Home – Gemma3 27B IT QAT

Conditions and context

Today I’m looking at Gemma3 27B IT QAT model, or more precisely Gemma3 27-Billion-Prompt Instruction-Tuned Quantization-Aware Training. What a mouthful. A hefty 54GB model and my tested variant was just shy of 20GB. Before we dive any deeper, first the usual disclaimer:

As in all my tests, I use the same prompt, the same hardware, and the same methodology. I’m looking at the same set of metrics across every model: VRAM usage, GPU utilization, CPU load, token throughput, tokens written, and total response time. These matter to me because they reveal whether a model is actually usable on consumer hardware — not just in theory, but in practice.

Specs	Value
Linux Distro	Ubuntu Server 24.04.4 LTS
Linux Kernel	6.8.0-101
CPU	Intel CORE i7 14th Gen 14700K Cores: 8P/12E Threads: 28
Motherboard	MSI PRO B660M-A
RAM	80 GB DDR4 (32+16+32+16)
SSD	Crucial NVME 1TB
GPU	MSI NVidia GeForce RTX 5060Ti Shadow 2X OC PCIe 5.0×8
CUDA Cores	4,608
VRAM	16 GB GDDR7 128-bit 448 GB/s
GPU Driver	NVidia 590.48.01
CUDA version	13.1
Ollama version	0.17.4
Model	Gemma3 27B
Quantization	QAT

The prompt

			
Write a simple Python function that checks if a number is prime.
Explain how it works in plain English, like you're teaching
a beginner.

The results

This is a 27-billion-parameter model, so yeah. It’s a heavyweight for a home computer in 2026. Coming in well over 19 GB for the QAT variant this beast put my computer through its paces and stretched the GPU beyond its limits. After loading into VRAM the prompt took on average well over 2 minutes to complete. Though on paper this should work and just barely fit within my 16GB GPU, it was instantly apparent that this model was way too much for my card. There was a near immediate overflow of computations from the graphics card into the CPU. As I monitored the system in Terminal using htop I could see nearly all cores of my CPU to be under heavy utilization and RAM usage spiked significantly (I have 80 GB, so no issues with paging). But, that GPU task saturation itself was a bad sign. Any time the GPU has to ask for help that introduces latency and when it comes to a 20 GB data beast — latency is never your friend. And it showed. The response was very slow, So slow, in fact that, I was able to follow and read every word as the model was typing. It took well over 2 minutes on some runs. The overall output quality was great. This is a very strong model, precise, eloquent, good reasoner. And although I was not a fan of its very spartan output formatting — there is no shred of any Markdown to speak of — I did like the final result.
What lets it down is the abysmal speed. Single-digit tokens per second is just unusable in real life. Even if you asked it to do a long-term research paper and let it write for half hour, you would sit there, twiddling your fingers. I’ll get more into this in the Conclusion below. It’s not the model’s fault.

Did I enjoy using it? No. The closest analogy coming to mind is: it was like watching paint dry. Is it a powerful model? Very much so! I wish I had a better card with at least 24 GB VRAM.
For my specific setup, Gemma ran an order of magnitude slower than GPT OSS (12 times slower). See my other review of that OpenAI’s model. Spoiler: it’s my favorite.

What puzzled me were the performance metrics. The spillover from GPU into CPU was so severe, the GPU actually barely clocked at a third of its power (that was a headscratcher) and the power consumption almost never exceeded 60W during the entire test. It only briefly spiked to its max 180W during the model loading, and that was a mere second or two. The entire “thinking” phase was between 50 and 55W, while the CPU was getting decimated. The GPU was not overwhelmed. On the contrary – it was starved as it had to hand over some of the tensor computations to the CPU and kept waiting. Just look at the snippet of htop above. It instantly lit up all 28 threads of my CPU and htop looked like a Christmas tree. Yikes.

I am only guessing and assuming here why a model that overflowed into CPU would only show GPU utilization of 39%, as I don’t know its detailed inner design. But my strong suspicion is that Gemma models were designed to work really well on Google’s proprietary TPU tensor cores and may not be fully optimized for CUDA cores of an ordinary graphics card. It may even be Ollama causing latency.

Model	Run	Response Token/sec	Total Time (sec)	Tokens Written	VRAM Util. (16 GB)	GPU Util.
Gemma3 27B	1	6.91	141	984	15.4	38%
Gemma3 27B	2	9.02	122	1121	15.4	36%
Gemma3 27B	3	7.57	126	965	15.4	32%

The conclusion – TL;DR

This model is a gorilla for home use: huge and very powerful. And that creates a paradox which lets it down for ordinary home use. The model is precise, excellent writer, very good explainer, but unless you run a multi $1k GPU on your PC with at least 24GB VRAM, it will disappoint you. After all, even with the QAT magic Google did here, squeezing a 27-billion parameter model into a 16GB card is very hard, despite its impressive engineering. Not a fan of lack of formatting on the output, but I’d gladly trade that for accuracy any day. I need to get a better GPU (and I just got one very recently! Ugh.). This model would do absolute wonders on a high end 24GB graphics card where it would have lots more room to breathe and demonstrate its true power at a decent speed.

And power it has. That is apparent, but for me it is a beautiful and unusable model. With an RTX 5060 Ti and 16 gig of VRAM? It’s a no go, unless you run overnight prompts and latency is a non-factor for you.
If you do need speed, it may be for you, but only with a better card.
On a 24GB A6000 or an RTX4090? It’d be terrifying. Fast AND meticulous. That combination would be hard to beat.

The model output (pasted & unedited)

			
def is_prime(number):
  """
  This function checks if a number is prime.
  Args:
    number: The number to check.
  Returns:
    True if the number is prime, False otherwise.
  """
  # Prime numbers are greater than 1.  Numbers less than or equal to 1 are not prime.
  if number <= 1:
    return False
  # 2 is a prime number. We handle this as a special case for efficiency.
  if number == 2:
    return True
  # If the number is even (divisible by 2) and not 2, it's not prime.
  if number % 2 == 0:
    return False
  # Now we check for divisibility by numbers from 3 up to the square root of the number.
  # We only need to check up to the square root because if a number has a divisor larger
  # than its square root, it must also have a divisor smaller than its square root.
  for i in range(3, int(number**0.5) + 1, 2):  # Check odd numbers only
    if number % i == 0:
      return False  # If divisible, it's not prime
  # If we haven't found any divisors, the number is prime.
  return True

		

How it works (in plain English):

Imagine you want to find out if a number, let’s say 17, is a prime number. Prime numbers are special numbers that are only divisible by 1 and themselves.

Here’s what the function does step-by-step:

Numbers less than or equal to 1 aren’t prime: First, it checks if the number is 1 or smaller. If it is, it immediately says “no, it’s not prime” and stops. This is because prime numbers must be bigger than 1.
2 is special: Then it checks if the number is 2. 2 is a prime number, so if it’s 2, the function says “yes, it’s prime” and stops.
Even numbers aren’t prime (except for 2): Next, it checks if the number is even (divisible by 2). If it’s even and it’s not 2, it says “no, it’s not prime” because any even number bigger than 2 can be divided by 2.
Checking for divisibility: Now comes the main part. It tries dividing the number by all the odd numbers starting from 3, and going up to the square root of the number. Why the square root? Well, if a number has a divisor bigger than its square root, it must also have a divisor smaller than its square root. So, we only need to check up to the square root to be sure. We check only odd numbers because we already checked for divisibility by 2.
If it divides evenly, it’s not prime: For each number it tries dividing by, if the remainder is 0 (meaning it divides evenly), it says “no, it’s not prime” and stops.
If it doesn’t divide evenly by anything, it’s prime: If it tries dividing by all the numbers up to the square root and none of them divide evenly, it means the number is only divisible by 1 and itself, so it says “yes, it’s prime”!

Example (with 17):

17 is greater than 1.
17 is not 2.
17 is not even.
It tries dividing 17 by 3, 5, 7. None of them divide evenly.
Therefore, 17 is prime!

The function then returns True because 17 is prime.

I hope this explanation helps! Let me know if you have any other questions.

Need Help?

Need help streamlining your processes or solving tricky business problems? I offer one-on-one consultations to get you unstuck fast. Book a free consultation with me today at goarcherdynamics.com.

Want more practical tips and workflow hacks? I publish them regularly on my blog — check it out and subscribe for newsletter updates: goarcherdynamics.com

Archer Dynamics