Small LLM Performance Benchmark – Research Report

 

This report presents the results of a systematic evaluation of 22 quantized open-source language models across description generation tasks, measuring quality, JSON reliability, and inference efficiency.

When you run thousands of small, focused tasks through a cloud LLM API, classifying a document, tagging a record, generating a description, extracting a field, the per-call fee is negligible. Multiply it by millions of daily operations across a production pipeline, and the bill becomes a boardroom conversation.

For real-time, customer-facing interactions, that premium is often justified. But most automated backend tasks don’t need real-time response. They need accuracy, reliability, and throughput at a cost that doesn’t erode your margin.

Smaller, locally deployed language models are a proven answer to this problem. Models in the 1B–14B parameter range, running on-premises or on modest cloud GPU instances under GGUF quantization, can handle the long tail of focused automation tasks, description generation, classification, summarization, structured data extraction, at a fraction of the API cost and with zero per-call billing.

The engineering barrier to deploying them has dropped significantly. The question is no longer can you run them, but which one should you run.

AscentCore evaluated 22 quantized configurations of 11 open-source models, from 1B to 14B parameters, under Q4 and Q8 quantization, across six description generation task variants.

We measured quality (ROUGE, factual consistency), structured output reliability (JSON parse rate, schema compliance), and inference efficiency (latency, tokens per second).

The results cut through the noise: Mistral 7B leads on raw quality, Qwen 2.5 7B delivers the best quality-to-speed ratio, and Llama 3.1 8B is the safest choice when your pipeline demands strict JSON schema compliance. For edge or mobile workloads, Qwen 2.5 1.5B holds up remarkably well at 167 tokens per second.

 

MODELS TESTED

22

Q4 and Q8 variants
TASK VARIANTS

6

brief / mid / strong × text / JSON
INFERENCE RUNS

1,012

46 per model
TOP ROUGE-L

0.509

mistral:7b-q8_0
BEST JSON SCHEMA

95.7%

llama3.1:8b-q8_0
FASTEST MODEL (TPS)

226

llama3.2:1b-q4_K_M

 

Experimental Setup

We evaluated 11 base models spanning two parameter tiers, each tested under Q4_K_M and Q8_0 GGUF quantization (All local models were served via Ollama for standardized inference.)

Tier A (sub-3B, edge/mobile candidates):

  • Gemma 3 1B
  • Llama 3.2 1B
  • Llama 3.2 3B
  • Qwen 2.5 1.5B
  • Phi-3.5 Mini (3.8B)
  • SmolLM2 1.7B

Tier B (7–14B, server-light/desktop candidates)

  • Gemma 3 4B
  • Gemma 3 12B
  • Llama 3.1 8B
  • Mistral 7B v0.3
  • Qwen 2.5 7B
  • Phi-3 Medium (14B)

Task Definition

Three description complexity levels were defined: 

  • Brief (1–2 sentences, max 50 words) targeting tooltips and search snippets;
  • Mid-Level (3–5 sentences, 75–150 words) targeting product pages and executive summaries
  • Strong (2–4 paragraphs, 200–400 words) targeting reports and in-depth documentation. 

Each level was tested with both a plain text output prompt and a structured JSON output prompt requiring specific schema adherence, yielding six task variants total.

Metrics

Quality was measured via ROUGE-1, ROUGE-2, and ROUGE-L (lexical overlap with reference), factual consistency (NLI-based alignment scoring), and length compliance.

For JSON tasks, we additionally measured JSON parse rate (syntactic validity), schema compliance rate (structural correctness via jsonschema validation), field completeness, and extraneous output rate. Efficiency was captured through time to first token (TTFT), total latency, and tokens per second (TPS).

Full Results Table

Model Tier ROUGE-1 ROUGE-2 ROUGE-L Factual TPS Latency (ms) TTFT (ms) JSON Parse % Schema % Len Comply % Repetition
gemma3:1b-q4_K_MA 0.545 0.266 0.388 0.587 163.5 1,968 234 82.6% 4.3% 78.3% 0.003
gemma3:1b-q8_0A 0.557 0.276 0.396 0.642 146.7 2,143 255 95.7% 4.3% 73.9% 0.003
llama3.2:1b-q4_K_MA 0.537 0.278 0.390 0.584 226.5 1,088 171 65.2% 30.4% 82.6% 0.008
llama3.2:1b-q8_0A 0.521 0.270 0.378 0.699 160.2 1,280 190 56.5% 30.4% 95.7% 0.005
qwen2.5:1.5b-q4_K_MA 0.561 0.271 0.398 0.593 167.5 1,414 207 95.7% 47.8% 87.0% 0.006
qwen2.5:1.5b-q8_0A 0.583 0.293 0.421 0.716 121.9 1,803 205 95.7% 56.5% 95.7% 0.006
smollm2:1.7b-q4_K_MA 0.561 0.309 0.408 0.728 157.3 1,480 192 26.1% 4.3% 87.0% 0.003
smollm2:1.7b-q8_0A 0.552 0.299 0.407 0.682 109.6 2,059 204 56.5% 26.1% 82.6% 0.004
llama3.2:3b-q4_K_MA 0.584 0.316 0.436 0.674 98.7 2,031 316 47.8% 34.8% 100.0% 0.004
llama3.2:3b-q8_0A 0.615 0.345 0.464 0.650 65.2 2,885 345 56.5% 52.2% 100.0% 0.002
phi3.5:3.8b-q4_K_MA 0.536 0.226 0.353 0.749 69.3 5,824 516 87.0% 65.2% 30.4% 0.052
phi3.5:3.8b-q8_0A 0.546 0.233 0.364 0.664 50.7 6,580 526 78.3% 56.5% 47.8% 0.003
gemma3:4b-q4_K_MB 0.578 0.295 0.427 0.697 71.6 3,927 452 100.0% 87.0% 78.3% 0.002
gemma3:4b-q8_0B 0.574 0.281 0.410 0.715 49.1 5,552 465 100.0% 69.6% 78.3% 0.001
qwen2.5:7b-q4_K_MB 0.612 0.336 0.462 0.705 48.1 4,270 606 95.7% 73.9% 95.7% 0.003
qwen2.5:7b-q8_0B 0.627 0.348 0.474 0.734 31.0 5,977 679 95.7% 65.2% 95.7% 0.001
mistral:7b-q4_K_MB 0.650 0.375 0.496 0.762 49.0 5,900 657 95.7% 39.1% 82.6% 0.009
mistral:7b-q8_0B 0.653 0.377 0.509 0.741 31.2 8,760 731 100.0% 47.8% 91.3% 0.015
llama3.1:8b-q4_K_MB 0.601 0.332 0.447 0.688 46.6 4,506 641 91.3% 91.3% 100.0% 0.006
llama3.1:8b-q8_0B 0.609 0.342 0.454 0.662 29.3 6,788 724 100.0% 95.7% 100.0% 0.005
gemma3:12b-q4_K_MB 0.598 0.311 0.435 0.699 27.3 10,545 1,171 100.0% 43.5% 78.3% 0.001
phi3:14b-q4_K_MB 0.612 0.314 0.433 0.731 23.6 12,882 1,402 95.7% 78.3% 65.2% 0.004

 

 Visual Comparisons

ROUGE-L by Model (all tasks avg)

Factual Consistency by Model

Tokens per Second (log scale)

BERTScore by Task Variant — Mistral 7B vs Llama 3.1 8B

 

JSON Reliability

JSON output mode introduces additional failure modes beyond text quality. The gap between parse rate and schema compliance is critical: a model that can produce valid JSON may still fail to match the required field structure.

Model Tier JSON Parse % Schema Comply % Field Completeness Extraneous Output % Key Accuracy
gemma3:1b-q4_K_MA82.6%4.3%0.826100.0%0.826
gemma3:1b-q8_0A95.7%4.3%0.957100.0%0.957
llama3.2:1b-q4_K_MA65.2%30.4%0.6524.3%0.652
llama3.2:1b-q8_0A56.5%30.4%0.5650.0%0.565
qwen2.5:1.5b-q4_K_MA95.7%47.8%0.9574.3%0.957
qwen2.5:1.5b-q8_0A95.7%56.5%0.9574.3%0.957
smollm2:1.7b-q4_K_MA26.1%4.3%0.2170.0%0.261
smollm2:1.7b-q8_0A56.5%26.1%0.5330.0%0.538
llama3.2:3b-q4_K_MA47.8%34.8%0.4780.0%0.478
llama3.2:3b-q8_0A56.5%52.2%0.5650.0%0.565
phi3.5:3.8b-q4_K_MA87.0%65.2%0.87065.2%0.870
phi3.5:3.8b-q8_0A78.3%56.5%0.78330.4%0.783
gemma3:4b-q4_K_MB100.0%87.0%1.000100.0%1.000
gemma3:4b-q8_0B100.0%69.6%1.000100.0%1.000
qwen2.5:7b-q4_K_MB95.7%73.9%0.9570.0%0.957
qwen2.5:7b-q8_0B95.7%65.2%0.9570.0%0.957
mistral:7b-q4_K_MB95.7%39.1%0.9570.0%0.957
mistral:7b-q8_0B100.0%47.8%1.0000.0%1.000
llama3.1:8b-q4_K_MB91.3%91.3%0.9130.0%0.913
llama3.1:8b-q8_0B100.0%95.7%1.0000.0%1.000
gemma3:12b-q4_K_MB100.0%43.5%1.000100.0%1.000
phi3:14b-q4_K_MB95.7%78.3%0.95734.8%0.957

JSON Parse Rate vs Schema Compliance — All Models

Inference Efficiency

Latency and throughput matter as much as quality for production deployments. The efficiency-quality frontier shows the real trade-offs between tiers.

Model Tier Tokens / sec Avg Latency (ms) Avg TTFT (ms) ROUGE-L Efficiency Score*
llama3.2:1b-q4_K_MA226.51,0881710.39088.3
gemma3:1b-q4_K_MA163.51,9682340.38863.5
smollm2:1.7b-q4_K_MA157.31,4801920.40864.2
qwen2.5:1.5b-q4_K_MA167.51,4142070.39866.7
llama3.2:3b-q4_K_MA98.72,0313160.43643.1
gemma3:4b-q4_K_MB71.63,9274520.42730.6
qwen2.5:7b-q4_K_MB48.14,2706060.46222.2
mistral:7b-q4_K_MB49.05,9006570.49624.3
llama3.1:8b-q4_K_MB46.64,5066410.44720.9
mistral:7b-q8_0B31.28,7607310.50915.9
llama3.1:8b-q8_0B29.36,7887240.45413.3
gemma3:12b-q4_K_MB27.310,5451,1710.43511.9
phi3:14b-q4_K_MB23.612,8821,4020.43310.2

 

Key Findings

Mistral 7B leads on text quality

Both Q4_K_M (ROUGE-L 0.496) and Q8_0 (0.509) variants top all other models on lexical fidelity and factual consistency (0.762 — highest in the benchmark). Especially strong on mid and strong description tasks.

Llama 3.1 8B is the JSON champion

Q8_0 achieves 100% parse rate and 95.7% schema compliance — the highest of any model — while producing zero extraneous output. Q4_K_M is a close second at 91.3% schema compliance.

Qwen 2.5 1.5B is the best sub-3B model

At Q8_0, Qwen 2.5 1.5B delivers ROUGE-L 0.421 and 95.7% JSON parse rate — competitive with some 7B models — while running at 121.9 TPS. The best choice when memory is the primary constraint.

Quantization impact varies by model family

Q8_0 vs Q4_K_M ROUGE-L delta ranges from +0.013 (Mistral 7B) down to near-zero (Gemma 3 4B). For speed-sensitive applications, Q4_K_M incurs minimal quality loss while delivering 40–60% higher TPS.

SmolLM2 fails JSON tasks at Q4_K_M

SmolLM2 1.7B Q4_K_M achieves only 26.1% JSON parse rate and 4.3% schema compliance — unusable for structured output. Q8_0 improves to 56.5%/26.1% but remains unreliable. Avoid for JSON pipelines.

Llama 3.2 3B struggles with JSON format

Despite reasonable text quality (ROUGE-L 0.436–0.464), Llama 3.2 3B achieves only 47.8–56.5% JSON parse rate — a notable regression vs Llama 3.1 8B. The 3B scale appears insufficient for reliable structured output.

Phi-3.5 Mini has the highest repetition rate

Phi-3.5 3.8B Q4_K_M shows a repetition rate of 0.052 — 5–50× higher than all other models — and the worst length compliance at 30.4%. Despite high factual consistency, it is not reliable for production text generation.

Gemma 3 4B: best JSON parse in Tier A

Gemma 3 4B achieves 100% JSON parse rate in both quants, though schema compliance drops to 87% (Q4_K_M). Notable: it adds extraneous commentary around JSON in 100% of runs — requires output stripping in production.

Large models don't justify the cost at this scale

Gemma 3 12B and Phi-3 14B offer ROUGE-L scores (0.433–0.435) below Llama 3.1 8B (0.454) while running at under 28 TPS with 10–13 second average latency. No quality advantage over 7–8B models was found.

All models show 100% non-refusal rate

No model refused benign synthetic inputs. Repetition rates are low across the board (<0.01) except Phi-3.5 Mini. Language consistency was 100% for all models — no unexpected code-switching observed.

 

Recommendations

Use Case Recommended Model Reason
Edge / mobile, text output qwen2.5:1.5b-q8_0 Best quality-per-parameter in sub-3B range. 121 TPS, ROUGE-L 0.421, low repetition.
Edge / mobile, JSON output qwen2.5:1.5b-q8_0 95.7% parse rate at only 1.5B params. Significantly more reliable than Llama 3.2 3B for structured tasks.
Max throughput, acceptable quality llama3.2:1b-q4_K_M 226 TPS — fastest in the benchmark. Use for high-volume brief description tasks where latency is critical.
Server-light, text quality priority mistral:7b-q4_K_M Top ROUGE and factual scores at ~49 TPS. Best balance of quality and speed in Tier B for text tasks.
Server-light, JSON reliability priority llama3.1:8b-q4_K_M 91.3% schema compliance, zero extraneous output, 100% length compliance. The most reliable structured-output model.
Best overall quality (unconstrained) mistral:7b-q8_0 Highest ROUGE-L (0.509) and factual consistency (0.741) of all models. Cost: ~31 TPS and 8.7s avg latency.
Avoid for JSON pipelines smollm2:1.7b, llama3.2:3b Both fail to reliably produce valid JSON. Deploy for text-only tasks only.
Avoid for any production use phi3.5:3.8b-q4_K_M Highest repetition rate (0.052), worst length compliance (30.4%), high extraneous output (65%). Not production-ready.

Share This Post

MORE TO EXPLORE

Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.