AI@Home – GPT OSS 20B – Archer Dynamics

Conditions and context

Here is my brief test of OpenAI’s free model GPT-OSS with 20 billion weight. Read on and judge for yourself whether this model is worth your time. Skip to the conclusion if you’re a TL;DR type.

As in all my tests, I use the same prompt, the same hardware, and the same methodology. I’m looking at the same set of metrics across every model: VRAM usage, GPU utilization, CPU load, token throughput, tokens written, and total response time. These matter to me because they reveal whether a model is actually usable on consumer hardware — not just in theory, but in practice.

Specs	Value
Linux Distro	Ubuntu Server 24.04.4 LTS
Linux Kernel	6.8.0-101
CPU	Intel CORE i7 14th Gen 14700K Cores: 8P/12E Threads: 28
Motherboard	MSI PRO B660M-A
RAM	80 GB DDR4 (32+16+32+16)
SSD	Crucial NVME 1TB
GPU	MSI NVidia GeForce RTX 5060Ti Shadow 2X OC PCIe 5.0×8
CUDA Cores	4,608
VRAM	16 GB GDDR7 128-bit 448 GB/s
GPU Driver	NVidia 590.48.01
CUDA version	13.1
Ollama version	0.17.4
Model	GPT-OSS 20B
Quantization	MXFP4

A few words about this model’s MXFP4 quantizer. This one’s interesting. Really interesting.
Co-developed by Microsoft, NVIDIA, AMD and Intel, this MX (micro scaling) quantizer uses floating point (FP) scaling factor across small blocks of numbers, so unlike more traditional (naive) 4-bit quantizers this one preserves lot more mathematical precision.

What is quantization?

Well, the AI models live in VRAM and they all live as real numbers (weights). So, the calculations and inferences inside the model are done as pure numerical operations. And given the size of the model, you sometimes must make it smaller, to make it work with the resources you have (server farm or a home computer). To ger around this you start introducing rounding errors and guesses (inferences) into your data. It is not too dissimilar from a RAW photograph and a JPEG – the more you compress the RAW photo the more data you are losing while trying to preserve the picture quality. Too little compression and the file is too big (AI model very accurate, but too heavy and too slow), too much compression and the JPEG is very small, but becomes so fuzzy you can’t even tell the details anymore and have to “imagine” parts of it, which are now too blurry or pixelated (model is very fast, but makes errors and hallucinates due to too much guessing and inferences).
You get the point.

AI models are 32-bit by default and quantization is used as a form of compression to make them fit onto smaller RAM modules – most of these models available to us for public end-user use are 4-bit. So, there definitely is a margin of error as an unwelcome byproduct.

Why does this matter?

Naive quantization loses precision randomly where some weights are rounded more aggressively than others, resulting in random errors in “thinking”. MXFP4 is more precise where it applies the randomized quantization, “dumbing down” of data, and keeps better accuracy in places where it actually matters — overall bringing far more accurate result, even if the models are the same size.

The prompt

			
Write a simple Python function that checks if a number is prime.
Explain how it works in plain English, like you're teaching
a beginner.

The results

I admit right up front. This GPT-OSS 20B model shocked me. The ferocity of its speed took me off guard at first, bringing up a big grin on my face. I don’t care how fast of a reader you are, there is no way any of you could ever read the output as fast as this model produces it. And it doesn’t sit and wait either. It starts firing off the reply within a second or two.

Model	Run	Response Token/sec	Total Time (sec)	Tokens Written	VRAM Util. (16 GB)	GPU Watts (180W max)	GPU Util.
GPT-OSS 20B	1	93.55	15	1419	14GB	169W	95%
GPT-OSS 20B	2	92.1	11	956	14GB	149W	95%
GPT-OSS 20B	3	90.96	11	940	14GB	151W	95%

During the test runs, an immediately visible pattern showed up: unlike every other tested model, power consumption briefly spiked to 150W at the start then rapidly tapered off to near zero while the model was still generating text. What? Yet, GPU utilization stayed in the high 90s the entire time. The likely explanation I suspect is Mixture of Experts (MoE) architecture.

GPT-OSS doesn’t activate all 20 billion parameters per token, instead routing each token through only a subset of specialized neural networks. Code generation is complex and activates more experts simultaneously, driving the initial wattage spike; text explanation is simpler, fewer experts fire off, and power drops down rapidly — while the routing hamsters keep GPU utilization pegged at high 90’s. This is the fundamental architectural contrast against Gemma 27B, which is a dense model that activates all of its 27 billion parameters for every single token, explaining both its constant high wattage and its molasses-level speed.

This is really fascinating!

The conclusion – TL;DR

GPT-OSS 20B is the standout of every model I’ve tested so far — and it’s not even close. 92 tokens/sec on average? Pure GPU, zero CPU spillover, 14 GB VRAM — textbook perfect utilization on consumer hardware. The Mixture of Experts architecture means it’s only activating a fraction of its parameters at a time despite being a 20B model, which explains both the blistering speed and the surprisingly low power draw. The code output was the best of all models I’ve tested: type hints, even-number optimization, a markdown explanation table, emoji-annotated comments, and a visible reasoning block showing its thinking before it wrote a single line. This is a reasoning model hiding in plain sight and it went well above and beyond what I’ve asked it to do.

This is historically significant too — If I’m not mistaken this is the first open weight release from OpenAI since GPT-2. You can run it in your house, on your own hardware, with zero cloud dependency and zero subscription. All you need is a 16GB VRAM graphics card. That combination is why this is, so far, the model to beat.

Two thumbs up!
And toes too!

The Output (Unedited)

One‑liner “prime check” (with a little helper for readability)

			
def is_prime(n: int) -> bool:
    """
    Return True if n is a prime number, otherwise False.
    A prime number is an integer > 1 that has no positive divisors other than 1 and itself.
    """
    if n <= 1:          # 0 and 1 are not prime
        return False
    if n == 2:          # 2 is the only even prime
        return True
    if n % 2 == 0:      # any other even number is not prime
        return False
    # Test only odd divisors up to the square root of n
    limit = int(n ** 0.5) + 1
    for d in range(3, limit, 2):   # step 2 → only odd numbers
        if n % d == 0:             # divisor found → composite
            return False
    return True                     # no divisors found → prime

		

Plain‑English explanation

Numbers ≤ 1
- Prime numbers must be greater than 1.
- If n is 0 or 1, the function returns False right away.
Number 2
- 2 is the only even prime.
- If n equals 2, we immediately return True.
Other even numbers
- Any even number greater than 2 can be divided by 2 with no remainder, so it isn’t prime.
- The if n % 2 == 0: line catches this and returns False.
Why only odd divisors?
- After step 3 we know n is odd.
- If a number has a divisor bigger than its square root, it must also have a smaller paired divisor (because a × b = n).
- Therefore, if we look for divisors only up to √n, we’ll find one if it exists.
- We skip even divisors (they’re already handled), so we test only 3, 5, 7, ….
The loop
- limit is the integer part of √n plus 1 (so the loop covers the whole needed range).
- for d in range(3, limit, 2): goes through every odd candidate d.
- if n % d == 0: checks whether d divides n.
- If the remainder is 0, we found a divisor → n is composite → return False.
- If the loop finishes without finding any divisor, the only divisors are 1 and n itself → n is prime → return True.

Quick tests

			
>>> is_prime(2)
True
>>> is_prime(3)
True
>>> is_prime(4)
False
>>> is_prime(17)
True
>>> is_prime(18)
False
>>> is_prime(97)
True

		

How fast is this?

The function checks at most about √n / 2 numbers.

Time complexity: O(√n) – good enough for everyday use (checking numbers up to millions or even billions on a normal laptop).
Memory usage: Only a few integers and a small loop – negligible.

Feel free to drop the function into any script, or call it from a notebook to quickly test whether a number is prime. Happy coding!

Need Help?

Need help streamlining your processes or solving tricky business problems? I offer one-on-one consultations to get you unstuck fast. Book a free consultation with me today at goarcherdynamics.com.

Want more practical tips and workflow hacks? I publish them regularly on my blog — check it out and subscribe for newsletter updates: goarcherdynamics.com

Archer Dynamics