Conditions & context
Today we are diving into a quick test of Google’s Gemma3, again a QAT quantized model, but this time a 12B. So, let’s hope that those 12 billion parameters do a better job than that “crazy” 4B sibling’s weights were. Let’s dive right in!
As in all my tests, I use the same prompt, the same hardware, and the same methodology. I’m looking at the same set of metrics across every model: VRAM usage, GPU utilization, CPU load, token throughput, tokens written, and total response time. These matter to me because they reveal whether a model is actually usable on consumer hardware — not just in theory, but in practice.
| Specs | Value |
|---|---|
| Linux Distro | Ubuntu Server 24.04.4 LTS |
| Linux Kernel | 6.8.0-101 |
| CPU | Intel CORE i7 14th Gen 14700K Cores: 8P/12E Threads: 28 |
| Motherboard | MSI PRO B660M-A |
| RAM | 80 GB DDR4 (32+16+32+16) |
| SSD | Crucial NVME 1TB |
| GPU | MSI NVidia GeForce RTX 5060Ti Shadow 2X OC PCIe 5.0×8 |
| CUDA Cores | 4,608 |
| VRAM | 16 GB GDDR7 128-bit 448 GB/s |
| GPU Driver | NVidia 590.48.01 |
| CUDA version | 13.1 |
| Ollama version | 0.17.4 |
| Model | Gemma 12B QAT |
| Quantization | QAT |
The prompt
Write a simple Python function that checks if a number is prime.Explain how it works in plain English, like you're teaching a beginner.
The results
Gemma3 12B QAT is a redemption story after that 4B debacle I encountered while testing its smaller sibling who degraded, hallucinated the Python code explanations it supposedly wrote, and as a cherry on top: lazily handed off work to the CPU, drowning my whole system.
The 12B holds the fort, though!
It felt stable with perfect 97% GPU utilization across all three runs, no CPU shenanigans, no logic errors, no embarrassing walkthrough bugs. Consistent token output nearly identical across runs (836, 835, 828). Google clearly did something right scaling this up.
I was surprised, admittedly I was half-expecting another slowness, thinking that the model’s preference for tensor cores would let it down again. I was wrong!
| Model | Quant | Run | Tokens/sec | Total Time | Tokens Written | VRAM | GPU Util |
|---|---|---|---|---|---|---|---|
| Gemma3 12B | QAT | 1 | 45.65 | 18s | 836 | 11GB | 97% |
| Gemma3 12B | QAT | 2 | 44.47 | 19s | 835 | 11GB | 97% |
| Gemma3 12B | QAT | 3 | 43.93 | 20s | 828 | 11GB | 97% |
At 44 tokens/sec average this model sits comfortably in the mid-tier of those I’ve tested — faster than Qwen3 14B, slightly behind Mistral Nemo 12B. The code was correct and clean, included working example usage, and the explanation was accurate and I actually understood it.
At 11GB VRAM it’s also firmly a 5060 Ti model — too big for the 3060 Ti, which narrows its audience considerably.
Gemma 12B QAT reads like a textbook — correct, thorough, the information is all there, but you don’t want to have a conversation with a textbook, you want to feel like you are talking to a colleague. That’s the fundamental difference between Gemma and Mistral’s training philosophy — Google optimized for accuracy, Mistral optimized for conversation. Both produce correct outputs. Only one feels like talking to someone.
The conclusion – TL;DR
While Gemma3 12B QAT is a legitimately useful and very capable model — and a significant improvement over the crazy little 4B — it’s boringly average. Compared directly to Mistral Nemo 12B, it loses not only on speed, but also on personality. If you’re running a 16GB card and want a mid-size model for serious work, Mistral Nemo is the harder model to argue against.
Gemma 12B earns a passing rating in my humble opinion, but not a permanent slot of actively used models. Why?
Two things:
First: Mistral Nemo runs at 47 tokens a second on 9.2GB VRAM versus Gemma’s output of 44 tokens a second, but eating up 11 GB. Yes, I do realize it is not a fair comparison as Mistral is a full inference model with “no pre-thinking”, but does it really matter at the end of the day? The numbers don’t lie. Mistral has better throughput with less VRAM eaten by it. And those extra 2 GB could allow you having a second model loaded on a 16 gig card. Think about it…
Second: I don’t like overly chatty models who pepper me with flowery sentences I don’t have time to read, but I also don’t want to be conversing with a model who is about as warm as a DMV waiting room. Maybe you do?
Your mileage may vary.
The output (pasted & unedited)
Okay, here’s a simple Python function to check if a number is prime, followed by a plain English explanation:
def is_prime(number): """ Checks if a number is prime. Args: number: The number to check (must be a positive integer). Returns: True if the number is prime, False otherwise. """ if number <= 1: return False # Numbers less than or equal to 1 are not prime # We only need to check divisibility up to the square root of the number. # If a number has a divisor greater than its square root, it must also # have a divisor smaller than its square root. for i in range(2, int(number**0.5) + 1): if number % i == 0: return False # If the number is divisible by any number from 2 up to its square root, it's not prime return True # If we've checked all the divisors and haven't found any, the number is prime# Example Usageprint(is_prime(7)) # Output: Trueprint(is_prime(12)) # Output: Falseprint(is_prime(2)) # Output: Trueprint(is_prime(1)) # Output: False
Explanation in Plain English (For Beginners):
Imagine you want to figure out if a number is “prime.” A prime number is a whole number greater than 1 that can only be divided evenly by 1 and itself. Think of numbers like 2, 3, 5, 7, 11…
Here’s how the function works:
- Handle Special Cases:
- If the number is less than or equal to 1, it’s not prime. So, we immediately return
False.
- If the number is less than or equal to 1, it’s not prime. So, we immediately return
- The “Trying Divisors” Loop:
- We loop through numbers starting from 2, up to the square root of the input number. Why the square root? Because if a number has a divisor larger than its square root, it must also have a divisor smaller than its square root. So, we only need to check up to the square root for efficiency.
number**0.5calculates the square root of the number.int()converts it to a whole number (integer) because we can only divide by whole numbers. We add+ 1to make sure we check the square root itself if it’s a whole number.
- The Divisibility Check:
- Inside the loop, we check if
numberis divisible byi(the current number we’re checking). We use the modulo operator%.number % i == 0means thatnumberdivided byileaves no remainder (it divides evenly). - If we find any number
ithat dividesnumberevenly (other than 1 and itself), thennumberis not prime. We immediately returnFalse.
- Inside the loop, we check if
- If No Divisors Found:
- If we get through the entire loop without finding any divisors (numbers that divide evenly), it means the number is only divisible by 1 and itself. Therefore, it is prime! We return
True.
- If we get through the entire loop without finding any divisors (numbers that divide evenly), it means the number is only divisible by 1 and itself. Therefore, it is prime! We return
In short: The function tries to divide the number by all numbers from 2 to its square root. If it finds any number that divides evenly, it knows the number isn’t prime. If it gets through all those checks without finding a divisor, then it’s prime!
Is this explanation clear? Would you like me to go over any part in more detail or give you another example?
Need Help?
Need help streamlining your processes or solving tricky business problems? I offer one-on-one consultations to get you unstuck fast. Book a free consultation with me today at goarcherdynamics.com.
Want more practical tips and workflow hacks? I publish them regularly on my blog — check it out and subscribe for newsletter updates: goarcherdynamics.com

Leave a Reply