Conditions and context
Here is my brief test of OpenAI’s free model GPT-OSS with 20 billion weight. Read on and judge for yourself whether this model is worth your time. Skip to the conclusion if you’re a TL;DR type.
As in all my tests, I use the same prompt, the same hardware, and the same methodology. I’m looking at the same set of metrics across every model: VRAM usage, GPU utilization, CPU load, token throughput, tokens written, and total response time. These matter to me because they reveal whether a model is actually usable on consumer hardware — not just in theory, but in practice.
| Specs | Value |
|---|---|
| Linux Distro | Ubuntu Server 24.04.4 LTS |
| Linux Kernel | 6.8.0-101 |
| CPU | Intel CORE i7 14th Gen 14700K Cores: 8P/12E Threads: 28 |
| Motherboard | MSI PRO B660M-A |
| RAM | 80 GB DDR4 (32+16+32+16) |
| SSD | Crucial NVME 1TB |
| GPU | MSI NVidia GeForce RTX 5060Ti Shadow 2X OC PCIe 5.0×8 |
| CUDA Cores | 4,608 |
| VRAM | 16 GB GDDR7 128-bit 448 GB/s |
| GPU Driver | NVidia 590.48.01 |
| CUDA version | 13.1 |
| Ollama version | 0.17.4 |
| Model | GPT-OSS 20B |
| Quantization | MXFP4 |
A few words about this model’s MXFP4 quantizer. This one’s interesting. Really interesting.
Co-developed by Microsoft, NVIDIA, AMD and Intel, this MX (micro scaling) quantizer uses floating point (FP) scaling factor across small blocks of numbers, so unlike more traditional (naive) 4-bit quantizers this one preserves lot more mathematical precision.
What is quantization?
Well, the AI models live in VRAM and they all live as real numbers (weights). So, the calculations and inferences inside the model are done as pure numerical operations. And given the size of the model, you sometimes must make it smaller, to make it work with the resources you have (server farm or a home computer). To ger around this you start introducing rounding errors and guesses (inferences) into your data. It is not too dissimilar from a RAW photograph and a JPEG – the more you compress the RAW photo the more data you are losing while trying to preserve the picture quality. Too little compression and the file is too big (AI model very accurate, but too heavy and too slow), too much compression and the JPEG is very small, but becomes so fuzzy you can’t even tell the details anymore and have to “imagine” parts of it, which are now too blurry or pixelated (model is very fast, but makes errors and hallucinates due to too much guessing and inferences).
You get the point.
AI models are 32-bit by default and quantization is used as a form of compression to make them fit onto smaller RAM modules – most of these models available to us for public end-user use are 4-bit. So, there definitely is a margin of error as an unwelcome byproduct.
Why does this matter?
Naive quantization loses precision randomly where some weights are rounded more aggressively than others, resulting in random errors in “thinking”. MXFP4 is more precise where it applies the randomized quantization, “dumbing down” of data, and keeps better accuracy in places where it actually matters — overall bringing far more accurate result, even if the models are the same size.
The prompt
Write a simple Python function that checks if a number is prime.Explain how it works in plain English, like you're teachinga beginner.
The results
I admit right up front. This GPT-OSS 20B model shocked me. The ferocity of its speed took me off guard at first, bringing up a big grin on my face. I don’t care how fast of a reader you are, there is no way any of you could ever read the output as fast as this model produces it. And it doesn’t sit and wait either. It starts firing off the reply within a second or two.
| Model | Run | Response Token/sec | Total Time (sec) | Tokens Written | VRAM Util. (16 GB) | GPU Watts (180W max) | GPU Util. |
|---|---|---|---|---|---|---|---|
| GPT-OSS 20B | 1 | 93.55 | 15 | 1419 | 14GB | 169W | 95% |
| GPT-OSS 20B | 2 | 92.1 | 11 | 956 | 14GB | 149W | 95% |
| GPT-OSS 20B | 3 | 90.96 | 11 | 940 | 14GB | 151W | 95% |
During the test runs, an immediately visible pattern showed up: unlike every other tested model, power consumption briefly spiked to 150W at the start then rapidly tapered off to near zero while the model was still generating text. What? Yet, GPU utilization stayed in the high 90s the entire time. The likely explanation I suspect is Mixture of Experts (MoE) architecture.
GPT-OSS doesn’t activate all 20 billion parameters per token, instead routing each token through only a subset of specialized neural networks. Code generation is complex and activates more experts simultaneously, driving the initial wattage spike; text explanation is simpler, fewer experts fire off, and power drops down rapidly — while the routing hamsters keep GPU utilization pegged at high 90’s. This is the fundamental architectural contrast against Gemma 27B, which is a dense model that activates all of its 27 billion parameters for every single token, explaining both its constant high wattage and its molasses-level speed.
This is really fascinating!
The conclusion – TL;DR
GPT-OSS 20B is the standout of every model I’ve tested so far — and it’s not even close. 92 tokens/sec on average? Pure GPU, zero CPU spillover, 14 GB VRAM — textbook perfect utilization on consumer hardware. The Mixture of Experts architecture means it’s only activating a fraction of its parameters at a time despite being a 20B model, which explains both the blistering speed and the surprisingly low power draw. The code output was the best of all models I’ve tested: type hints, even-number optimization, a markdown explanation table, emoji-annotated comments, and a visible reasoning block showing its thinking before it wrote a single line. This is a reasoning model hiding in plain sight and it went well above and beyond what I’ve asked it to do.
This is historically significant too — If I’m not mistaken this is the first open weight release from OpenAI since GPT-2. You can run it in your house, on your own hardware, with zero cloud dependency and zero subscription. All you need is a 16GB VRAM graphics card. That combination is why this is, so far, the model to beat.
Two thumbs up!
And toes too!
The Output (Unedited)
One‑liner “prime check” (with a little helper for readability)
def is_prime(n: int) -> bool: """ Return True if n is a prime number, otherwise False. A prime number is an integer > 1 that has no positive divisors other than 1 and itself. """ if n <= 1: # 0 and 1 are not prime return False if n == 2: # 2 is the only even prime return True if n % 2 == 0: # any other even number is not prime return False # Test only odd divisors up to the square root of n limit = int(n ** 0.5) + 1 for d in range(3, limit, 2): # step 2 → only odd numbers if n % d == 0: # divisor found → composite return False return True # no divisors found → prime
Plain‑English explanation
- Numbers ≤ 1
- Prime numbers must be greater than 1.
- If
nis 0 or 1, the function returnsFalseright away.
- Number 2
- 2 is the only even prime.
- If
nequals 2, we immediately returnTrue.
- Other even numbers
- Any even number greater than 2 can be divided by 2 with no remainder, so it isn’t prime.
- The
if n % 2 == 0:line catches this and returnsFalse.
- Why only odd divisors?
- After step 3 we know
nis odd. - If a number has a divisor bigger than its square root, it must also have a smaller paired divisor (because
a × b = n). - Therefore, if we look for divisors only up to
√n, we’ll find one if it exists. - We skip even divisors (they’re already handled), so we test only 3, 5, 7, ….
- After step 3 we know
- The loop
limitis the integer part of√nplus 1 (so the loop covers the whole needed range).for d in range(3, limit, 2):goes through every odd candidated.if n % d == 0:checks whetherddividesn.- If the remainder is 0, we found a divisor →
nis composite → returnFalse. - If the loop finishes without finding any divisor, the only divisors are 1 and
nitself →nis prime → returnTrue.
Quick tests
>>> is_prime(2)True>>> is_prime(3)True>>> is_prime(4)False>>> is_prime(17)True>>> is_prime(18)False>>> is_prime(97)True
How fast is this?
The function checks at most about √n / 2 numbers.
- Time complexity: O(√n) – good enough for everyday use (checking numbers up to millions or even billions on a normal laptop).
- Memory usage: Only a few integers and a small loop – negligible.
Feel free to drop the function into any script, or call it from a notebook to quickly test whether a number is prime. Happy coding!
Need Help?
Need help streamlining your processes or solving tricky business problems? I offer one-on-one consultations to get you unstuck fast. Book a free consultation with me today at goarcherdynamics.com.
Want more practical tips and workflow hacks? I publish them regularly on my blog — check it out and subscribe for newsletter updates: goarcherdynamics.com

Leave a Reply