AI@Home – GPT OSS 20B

Conditions and context

Here is my brief test of OpenAI’s free model GPT-OSS with 20 billion weight. Read on and judge for yourself whether this model is worth your time. Skip to the conclusion if you’re a TL;DR type.

As in all my tests, I use the same prompt, the same hardware, and the same methodology. I’m looking at the same set of metrics across every model: VRAM usage, GPU utilization, CPU load, token throughput, tokens written, and total response time. These matter to me because they reveal whether a model is actually usable on consumer hardware — not just in theory, but in practice.

SpecsValue
Linux DistroUbuntu Server 24.04.4 LTS
Linux Kernel6.8.0-101
CPUIntel CORE i7 14th Gen 14700K Cores: 8P/12E Threads: 28
MotherboardMSI PRO B660M-A
RAM80 GB DDR4 (32+16+32+16)
SSDCrucial NVME 1TB
GPUMSI NVidia GeForce RTX 5060Ti Shadow 2X OC PCIe 5.0×8
CUDA Cores4,608
VRAM16 GB GDDR7 128-bit 448 GB/s
GPU DriverNVidia 590.48.01
CUDA version13.1
Ollama version0.17.4
ModelGPT-OSS 20B
QuantizationMXFP4


A few words about this model’s MXFP4 quantizer. This one’s interesting. Really interesting.
Co-developed by Microsoft, NVIDIA, AMD and Intel, this MX (micro scaling) quantizer uses floating point (FP) scaling factor across small blocks of numbers, so unlike more traditional (naive) 4-bit quantizers this one preserves lot more mathematical precision.

What is quantization?

Well, the AI models live in VRAM and they all live as real numbers (weights). So, the calculations and inferences inside the model are done as pure numerical operations. And given the size of the model, you sometimes must make it smaller, to make it work with the resources you have (server farm or a home computer). To ger around this you start introducing rounding errors and guesses (inferences) into your data. It is not too dissimilar from a RAW photograph and a JPEG – the more you compress the RAW photo the more data you are losing while trying to preserve the picture quality. Too little compression and the file is too big (AI model very accurate, but too heavy and too slow), too much compression and the JPEG is very small, but becomes so fuzzy you can’t even tell the details anymore and have to “imagine” parts of it, which are now too blurry or pixelated (model is very fast, but makes errors and hallucinates due to too much guessing and inferences).
You get the point.

AI models are 32-bit by default and quantization is used as a form of compression to make them fit onto smaller RAM modules – most of these models available to us for public end-user use are 4-bit. So, there definitely is a margin of error as an unwelcome byproduct.

Why does this matter?

Naive quantization loses precision randomly where some weights are rounded more aggressively than others, resulting in random errors in “thinking”. MXFP4 is more precise where it applies the randomized quantization, “dumbing down” of data, and keeps better accuracy in places where it actually matters — overall bringing far more accurate result, even if the models are the same size.

The prompt

Write a simple Python function that checks if a number is prime.
Explain how it works in plain English, like you're teaching
a beginner.
The results

I admit right up front. This GPT-OSS 20B model shocked me. The ferocity of its speed took me off guard at first, bringing up a big grin on my face. I don’t care how fast of a reader you are, there is no way any of you could ever read the output as fast as this model produces it. And it doesn’t sit and wait either. It starts firing off the reply within a second or two.

ModelRunResponse
Token/sec
Total Time
(sec)
Tokens
Written
VRAM Util.
(16 GB)
GPU Watts
(180W max)
GPU
Util.
GPT-OSS 20B193.5515141914GB169W95%
GPT-OSS 20B292.11195614GB149W95%
GPT-OSS 20B390.961194014GB151W95%

During the test runs, an immediately visible pattern showed up: unlike every other tested model, power consumption briefly spiked to 150W at the start then rapidly tapered off to near zero while the model was still generating text. What? Yet, GPU utilization stayed in the high 90s the entire time. The likely explanation I suspect is Mixture of Experts (MoE) architecture.

GPT-OSS doesn’t activate all 20 billion parameters per token, instead routing each token through only a subset of specialized neural networks. Code generation is complex and activates more experts simultaneously, driving the initial wattage spike; text explanation is simpler, fewer experts fire off, and power drops down rapidly — while the routing hamsters keep GPU utilization pegged at high 90’s. This is the fundamental architectural contrast against Gemma 27B, which is a dense model that activates all of its 27 billion parameters for every single token, explaining both its constant high wattage and its molasses-level speed.

This is really fascinating!

The conclusion – TL;DR

GPT-OSS 20B is the standout of every model I’ve tested so far — and it’s not even close. 92 tokens/sec on average? Pure GPU, zero CPU spillover, 14 GB VRAM — textbook perfect utilization on consumer hardware. The Mixture of Experts architecture means it’s only activating a fraction of its parameters at a time despite being a 20B model, which explains both the blistering speed and the surprisingly low power draw. The code output was the best of all models I’ve tested: type hints, even-number optimization, a markdown explanation table, emoji-annotated comments, and a visible reasoning block showing its thinking before it wrote a single line. This is a reasoning model hiding in plain sight and it went well above and beyond what I’ve asked it to do.

This is historically significant too — If I’m not mistaken this is the first open weight release from OpenAI since GPT-2. You can run it in your house, on your own hardware, with zero cloud dependency and zero subscription. All you need is a 16GB VRAM graphics card. That combination is why this is, so far, the model to beat.

Two thumbs up!
And toes too!

The Output (Unedited)

One‑liner “prime check” (with a little helper for readability)

def is_prime(n: int) -> bool:
"""
Return True if n is a prime number, otherwise False.
A prime number is an integer > 1 that has no positive divisors other than 1 and itself.
"""
if n <= 1: # 0 and 1 are not prime
return False
if n == 2: # 2 is the only even prime
return True
if n % 2 == 0: # any other even number is not prime
return False
# Test only odd divisors up to the square root of n
limit = int(n ** 0.5) + 1
for d in range(3, limit, 2): # step 2 → only odd numbers
if n % d == 0: # divisor found → composite
return False
return True # no divisors found → prime

Plain‑English explanation

  1. Numbers ≤ 1
    • Prime numbers must be greater than 1.
    • If n is 0 or 1, the function returns False right away.
  2. Number 2
    • 2 is the only even prime.
    • If n equals 2, we immediately return True.
  3. Other even numbers
    • Any even number greater than 2 can be divided by 2 with no remainder, so it isn’t prime.
    • The if n % 2 == 0: line catches this and returns False.
  4. Why only odd divisors?
    • After step 3 we know n is odd.
    • If a number has a divisor bigger than its square root, it must also have a smaller paired divisor (because a × b = n).
    • Therefore, if we look for divisors only up to √n, we’ll find one if it exists.
    • We skip even divisors (they’re already handled), so we test only 3, 5, 7, ….
  5. The loop
    • limit is the integer part of √n plus 1 (so the loop covers the whole needed range).
    • for d in range(3, limit, 2): goes through every odd candidate d.
    • if n % d == 0: checks whether d divides n.
    • If the remainder is 0, we found a divisor → n is composite → return False.
    • If the loop finishes without finding any divisor, the only divisors are 1 and n itself → n is prime → return True.

Quick tests

>>> is_prime(2)
True
>>> is_prime(3)
True
>>> is_prime(4)
False
>>> is_prime(17)
True
>>> is_prime(18)
False
>>> is_prime(97)
True

How fast is this?

The function checks at most about √n / 2 numbers.

Feel free to drop the function into any script, or call it from a notebook to quickly test whether a number is prime. Happy coding!

Need Help?

Need help streamlining your processes or solving tricky business problems? I offer one-on-one consultations to get you unstuck fast. Book a free consultation with me today at goarcherdynamics.com.

Want more practical tips and workflow hacks? I publish them regularly on my blog — check it out and subscribe for newsletter updates: goarcherdynamics.com

Jiri Krecek Avatar

Posted by

Leave a Reply

Discover more from Archer Dynamics

Subscribe now to keep reading and get access to the full archive.

Continue reading