AI@Home – Mistral 3.1 Small 24B Review

Conditions & Context

Today we are going back to France! On the table is a Mistral 3.1 Small with a decent 24B weight. It will be a tight squeeze onto a 16GB GPU, so I expect some CPU cores being lit up, but let’s see if it did as bad as Gwen. Let’s dive right in!

I picked a very simple prompt which contains a mixture of code generation and some reasoning logic, and writing prowess. What I’m looking for is a number of metrics of interest to me: how much VRAM the model uses, utilization of GPU, wattage and temperature of GPU, CPU utilization, token throughput, total number of tokens written, total time to response. All these are important to me as not only do they match the best model for my hardware, but also provide the best quality of UX for me as an end user. I focus on overall quality of the answer, but most importantly on the reasoning and explanation to someone who is a novice in the field. My goal here is to show whether the model is usable and good enough to help someone who is seeking assistance in learning how to code or write code.

SpecsValue
Linux DistroUbuntu Server 24.04.4 LTS
Linux Kernel6.8.0-101
CPUIntel CORE i7 14th Gen 14700K Cores: 8P/12E Threads: 28
MotherboardMSI PRO B660M-A
RAM80 GB DDR4 (32+16+32+16)
SSDCrucial NVME 1TB
GPUMSI NVidia GeForce RTX 5060Ti Shadow 2X OC PCIe 5.0×8
CUDA Cores4,608
VRAM16 GB GDDR7 128-bit 448 GB/s
GPU DriverNVidia 590.48.01
CUDA version13.1
Ollama version0.17.4
ModelMistral3.1 Small 24B
QuantizationQ4_K_M

The Prompt

Write a simple Python function that checks if a number is prime.
Explain how it works in plain English, like you're teaching
a beginner.

The Results

Let’s face it: a nearly 15GB model on a 16 GB VRAM graphics card is just about pushing the limits, and then some. Not quite thrown over the fence yet, but gingerly teetering on top of it, unable to decide which way to fall. This model is a chugger on my 5060Ti. No bones about it. Barely eking out 20 tokens a second while GPT-OSS 20B would run nearly five times as many. But…. there is a big “but”.
This model is GOOD!

ModelQuantRunTokens/sTotal Time (s)Tokens WrittenVRAM (GB)GPU Util
Mistral Small 3.1Q4_K_M120.1729s59314.468%
Mistral Small 3.1Q4_K_M220.0633s65614.468%
Mistral Small 3.1Q4_K_M319.9131s61414.468%

14.4GB VRAM — locked and immovable across all three runs. On a 16GB card, that leaves exactly 1.6GB of breathing room for active context and KV cache. And let’s face it: you will want to stick to OWUI’s 2k default context window size. Anything bigger and you are toast, jetting of to the CPU land and going for a coffee break. The model ran at a steady 68% GPU utilization — notably below the 97% rail-pinned behavior I saw from Ministral 8B. That’s not laziness on the GPU’s part. That’s memory bandwidth being the real bottleneck, not compute. Token throughput held rock solid too: 20.17, 20.06, 19.91. No warmup lag, no degradation. You get exactly what it says on the label, every single run. That makes it predictable. That is good news.

Token counts were equally steady: 593 → 656 → 614. No progressive wordiness, no warmup effect like Ministral 8B’s climbing runs. This model finds its output level and parks there. The shocker is the quality. Clean code, structured walkthrough, accurate step-by-step explanation aimed squarely at a beginner. For a model pressed this close to the VRAM ceiling, it has no business being this competent on my kind of GPU. Speed is not Mistral Small 3.1’s strength on a measly 16GB card — but the quality is!

The Conclusion – TL;DR

Speed is not everything. Not always. There is time when you have a rather slow model for your hardware and its replies feel meh. They just don’t light your soul on fire. So, you just <strong>sudo rm [fill-in-the-blank]</strong> and move on some better things.
But then there may be models, which albeit slower, have pizzazz. They sound good, they are accurate, they go beyond what other models of this size usually bring to the table, and in some situations, they can sound as convincing as a 600B frontier model you are used to chatting with.
Is this such a unicorn? I don’t know, but I will try to tweak this Mistral with a system prompt and see what it has to offer.

Afterall — humans are mostly comfortable reading text at about 15 tokens/sec, so 20 is slow, but still perfectly in the vicinity of comfort. Would I want this model to give me long answers? Hell no!
That would take a coffee break.

But as a daily driver for short queries while knowing it is surprisingly accurate and capable? That is what I’m setting to find out. I sense a sequel coming up on this one…

P.S.:
This model reminded me in a heavy-handed way of the age-old conundrum we local model runners all face:

The moment you unbox your brand-spanking new GPU you wish
you had bought one with more VRAM!
This never ever ends, does it?

The Output (unedited)

Sure! Let’s write a simple Python function to check if a number is prime. A prime number is a number that is greater than 1 and has no divisors other than 1 and itself.

Here’s the function:

def is_prime(n):
"""Check if a number is prime."""
if n <= 1:
return False
if n == 2:
return True # 2 is the only even prime number
if n % 2 == 0:
return False # Any other even number is not prime
# Check for factors from 3 to the square root of n
for i in range(3, int(n**0.5) + 1, 2):
if n % i == 0:
return False
return True

Explanation in Plain English:

  1. Check if the number is less than or equal to 1:
    • If n is 1 or less, it’s not a prime number. So, we return False.
  2. Check if the number is 2:
    • If n is 2, it’s a prime number. So, we return True.
  3. Check if the number is even and greater than 2:
    • If n is an even number and greater than 2, it’s not a prime number. So, we return False.
  4. Check for factors from 3 to the square root of n:
    • We only need to check for factors up to the square root of n because if n has a factor larger than its square root, it must also have a corresponding factor smaller than its square root.
    • We start checking from 3 and skip even numbers by incrementing by 2 each time (range(3, int(n**0.5) + 1, 2)).
    • If we find any number i that divides n evenly (i.e., n % i == 0), then n is not a prime number. So, we return False.
  5. If no factors are found:
    • If we didn’t find any factors, then n is a prime number. So, we return True.

Example Usage:

print(is_prime(11)) # Output: True (11 is a prime number)
print(is_prime(4)) # Output: False (4 is not a prime number)
print(is_prime(2)) # Output: True (2 is a prime number)
print(is_prime(1)) # Output: False (1 is not a prime number)

This function should work well for checking if a number is prime in a straightforward and efficient manner.

Need Help?

Need help streamlining your processes or solving tricky business problems? I offer one-on-one consultations to get you unstuck fast. Book a free consultation with me today at goarcherdynamics.com.

Want more practical tips and workflow hacks? I publish them regularly on my blog — check it out and subscribe for newsletter updates: goarcherdynamics.com

Jiri Krecek Avatar

Posted by

Leave a Reply

Discover more from Archer Dynamics

Subscribe now to keep reading and get access to the full archive.

Continue reading