AI@Home – Mistral 3.1 Small 24B Review

Conditions & Context

Today we are going back to France! On the table is a Mistral 3.1 Small with a decent 24B weight. It will be a tight squeeze onto a 16GB GPU, so I expect some CPU cores being lit up, but let’s see if it did as bad as Gwen. Let’s dive right in!

I picked a very simple prompt which contains a mixture of code generation and some reasoning logic, and writing prowess. What I’m looking for is a number of metrics of interest to me: how much VRAM the model uses, utilization of GPU, wattage and temperature of GPU, CPU utilization, token throughput, total number of tokens written, total time to response. All these are important to me as not only do they match the best model for my hardware, but also provide the best quality of UX for me as an end user. I focus on overall quality of the answer, but most importantly on the reasoning and explanation to someone who is a novice in the field. My goal here is to show whether the model is usable and good enough to help someone who is seeking assistance in learning how to code or write code.

Specs	Value
Linux Distro	Ubuntu Server 24.04.4 LTS
Linux Kernel	6.8.0-101
CPU	Intel CORE i7 14th Gen 14700K Cores: 8P/12E Threads: 28
Motherboard	MSI PRO B660M-A
RAM	80 GB DDR4 (32+16+32+16)
SSD	Crucial NVME 1TB
GPU	MSI NVidia GeForce RTX 5060Ti Shadow 2X OC PCIe 5.0×8
CUDA Cores	4,608
VRAM	16 GB GDDR7 128-bit 448 GB/s
GPU Driver	NVidia 590.48.01
CUDA version	13.1
Ollama version	0.17.4
Model	Mistral3.1 Small 24B
Quantization	Q4_K_M

The Prompt

			
Write a simple Python function that checks if a number is prime.
Explain how it works in plain English, like you're teaching
a beginner.

The Results

Let’s face it: a nearly 15GB model on a 16 GB VRAM graphics card is just about pushing the limits, and then some. Not quite thrown over the fence yet, but gingerly teetering on top of it, unable to decide which way to fall. This model is a chugger on my 5060Ti. No bones about it. Barely eking out 20 tokens a second while GPT-OSS 20B would run nearly five times as many. But…. there is a big “but”.
This model is GOOD!

Model	Quant	Run	Tokens/s	Total Time (s)	Tokens Written	VRAM (GB)	GPU Util
Mistral Small 3.1	Q4_K_M	1	20.17	29s	593	14.4	68%
Mistral Small 3.1	Q4_K_M	2	20.06	33s	656	14.4	68%
Mistral Small 3.1	Q4_K_M	3	19.91	31s	614	14.4	68%

14.4GB VRAM — locked and immovable across all three runs. On a 16GB card, that leaves exactly 1.6GB of breathing room for active context and KV cache. And let’s face it: you will want to stick to OWUI’s 2k default context window size. Anything bigger and you are toast, jetting of to the CPU land and going for a coffee break. The model ran at a steady 68% GPU utilization — notably below the 97% rail-pinned behavior I saw from Ministral 8B. That’s not laziness on the GPU’s part. That’s memory bandwidth being the real bottleneck, not compute. Token throughput held rock solid too: 20.17, 20.06, 19.91. No warmup lag, no degradation. You get exactly what it says on the label, every single run. That makes it predictable. That is good news.

Token counts were equally steady: 593 → 656 → 614. No progressive wordiness, no warmup effect like Ministral 8B’s climbing runs. This model finds its output level and parks there. The shocker is the quality. Clean code, structured walkthrough, accurate step-by-step explanation aimed squarely at a beginner. For a model pressed this close to the VRAM ceiling, it has no business being this competent on my kind of GPU. Speed is not Mistral Small 3.1’s strength on a measly 16GB card — but the quality is!

The Conclusion – TL;DR

Speed is not everything. Not always. There is time when you have a rather slow model for your hardware and its replies feel meh. They just don’t light your soul on fire. So, you just <strong>sudo rm [fill-in-the-blank]</strong> and move on some better things.
But then there may be models, which albeit slower, have pizzazz. They sound good, they are accurate, they go beyond what other models of this size usually bring to the table, and in some situations, they can sound as convincing as a 600B frontier model you are used to chatting with.
Is this such a unicorn? I don’t know, but I will try to tweak this Mistral with a system prompt and see what it has to offer.

Afterall — humans are mostly comfortable reading text at about 15 tokens/sec, so 20 is slow, but still perfectly in the vicinity of comfort. Would I want this model to give me long answers? Hell no!
That would take a coffee break.

But as a daily driver for short queries while knowing it is surprisingly accurate and capable? That is what I’m setting to find out. I sense a sequel coming up on this one…

P.S.:
This model reminded me in a heavy-handed way of the age-old conundrum we local model runners all face:

			
The moment you unbox your brand-spanking new GPU you wish
you had bought one with more VRAM! 
This never ever ends, does it?

The Output (unedited)

Sure! Let’s write a simple Python function to check if a number is prime. A prime number is a number that is greater than 1 and has no divisors other than 1 and itself.

Here’s the function:

			
def is_prime(n):
    """Check if a number is prime."""
    if n <= 1:
        return False
    if n == 2:
        return True  # 2 is the only even prime number
    if n % 2 == 0:
        return False  # Any other even number is not prime
    # Check for factors from 3 to the square root of n
    for i in range(3, int(n**0.5) + 1, 2):
        if n % i == 0:
            return False
    return True

		

Explanation in Plain English:

Check if the number is less than or equal to 1:
- If n is 1 or less, it’s not a prime number. So, we return False.
Check if the number is 2:
- If n is 2, it’s a prime number. So, we return True.
Check if the number is even and greater than 2:
- If n is an even number and greater than 2, it’s not a prime number. So, we return False.
Check for factors from 3 to the square root of n:
- We only need to check for factors up to the square root of n because if n has a factor larger than its square root, it must also have a corresponding factor smaller than its square root.
- We start checking from 3 and skip even numbers by incrementing by 2 each time (range(3, int(n**0.5) + 1, 2)).
- If we find any number i that divides n evenly (i.e., n % i == 0), then n is not a prime number. So, we return False.
If no factors are found:
- If we didn’t find any factors, then n is a prime number. So, we return True.

Example Usage:

			
print(is_prime(11))  # Output: True (11 is a prime number)
print(is_prime(4))   # Output: False (4 is not a prime number)
print(is_prime(2))   # Output: True (2 is a prime number)
print(is_prime(1))   # Output: False (1 is not a prime number)

This function should work well for checking if a number is prime in a straightforward and efficient manner.

Need Help?

Need help streamlining your processes or solving tricky business problems? I offer one-on-one consultations to get you unstuck fast. Book a free consultation with me today at goarcherdynamics.com.

Want more practical tips and workflow hacks? I publish them regularly on my blog — check it out and subscribe for newsletter updates: goarcherdynamics.com

Archer Dynamics