When “Errors” Speak: A Comparative Field Guide to Human and LLM Fallibility

The perspectives below come from a mathematician’s vantage point. They are not the product of formal training in behavioral psychology, and any remarks about human behavior may therefore be incomplete. The aim is pragmatic clarity rather than exhaustive theory.

tl;dr

Modern language models (LLMs) and humans both produce mistakes that look similar—fabricated facts, misplaced confidence, selective omissions—but they arise from fundamentally different substrates. Human errors emerge from attention, memory, motivation, and social context; LLM errors emerge from statistical generalization, context window limits, decoding choices, safety policies, and retrieval failures. Treating them as identical leads to misdiagnosis and weak safeguards. Treating them as cousins—phenomenologically similar but mechanistically distinct—makes design, governance, and remediation vastly more effective.

This essay proposes a simple taxonomy of “errors,” offers an expanded summary table you can reuse in practice, and then develops concrete guidance: diagnostics, controls, and operating patterns that reduce harm without hobbling usefulness.

What Counts as an “Error”?

“Error” is not a single thing. It’s a family of failure modes that differ in cause, intention, and risk profile.

Unconscious errors / hallucinations. Plausible but false continuations. Humans confabulate; LLMs pattern-complete when evidence is thin.
Deliberate omission and secrecy. Withholding information intentionally (humans) or by policy/architecture (LLMs).
Lying. Intention matters for humans. LLMs lack intention but can still produce falsehoods when asked to invent or when miscalibrated.
Remembering and forgetting. Retrieval dynamics (cues, context) versus context-window mechanics and access permissions.
Deliberate errors. Pedagogical or stylistic mistakes introduced on purpose—by teachers, by writers, or by instruction to a model.
“Human-like” appearance. The surface may mislead: similar outputs, dissimilar roots.

A system that treats all of these as one bucket will apply the wrong remedy. The chosen mitigation must match the mechanism.

Comparative Summary Table

The table collects the categories above and folds in sub-mechanisms—along with a pragmatic Diagnostics & Controls column meant for real deployments.

[table id=6 /]

Why the Similarity Is Deceptive

A fabricated book title from a person and a fabricated book title from an LLM look identical in a spreadsheet of errors. But the remediation differs:

A person might have status motives or memory gaps. Correction involves incentives, training, or better note-taking.
An LLM typically pattern-completes in a weak-evidence regime or after context loss. Correction involves retrieval design, decoding parameters, or grounding.

Conflating the two makes governance performative: heavy policies on the wrong levers and lightweight policies on the crucial ones.

From Mechanism to Risk: A Practical Risk Matrix

Different “error” families have different consequence profiles.

Low-risk, high-frequency: minor hallucinations in creative drafting; harmless typos; role-played fabrication in a labeled sandbox.
Control: labeling, low-temperature toggles, and easy rollback.
Medium-risk: factual claims in internal reports; policy-gated omissions that confuse users; retrieval staleness in dashboards.
Control: require evidence objects, data freshness checks, and explainability notes when content is withheld.
High-risk: medical, legal, financial claims; safety-critical instructions; identity/PII handling; regulatory submissions.
Control: human-in-the-loop sign-off; verified retrieval only; strict “no-answer if uncertain” policies; audit logs; red-team tests.

Diagnostics That Actually Move the Needle

A few diagnostics repeatedly prove their worth:

Evidence Objects. Attach the data used: snippets, citations, vector hits, tool results. Machine-readable and human-legible.
Uncertainty Surface. Calibrated confidence and abstention rates matter more than raw accuracy in safety-critical flows.
Decoding Profiles. Maintain named profiles (e.g., creative, factual, forensic) with locked temperature/top-p/beam constraints.
Context Telemetry. Log input length, tokens retained vs. dropped, retrieval hit quality, and duplication. Most “hallucinations” correlate with context starvation or noisy retrieval.
Role Separation. Separate drafting from verifying. Different prompts, different tool access, different parameters.
Regression Packs. Curate failure exemplars by category (fabricated citations, made-up params, omission under policy) and run them before each model or prompt change.

Controls That Prevent Predictable Pain

Design controls map neatly to the mechanisms in the table.

Retrieve-then-Generate. For factual tasks, search or query first; synthesize second. Penalize answers with zero cited inputs.
Low-Temperature Defaults. Treat “0.1–0.3” as the default for assertions; allow higher values behind explicit toggles labeled as such.
Schema-First Outputs. Ask for a JSON or table with fields for claim, source, confidence, and limitations. Free-form prose comes after the structured record.
Explain Omissions. If safety, privacy, or access constraints hide content, say so. “Redacted due to policy: [category].” This transforms confusion into trust.
Grounding Contracts. For high-stakes domains, require that every claim maps to a source in a whitelisted corpus with recorded freshness.
Stop-Words for Safety. Just as offensive terms can be blocked, so can phrases that typically signal made-up sourcing: “as everyone knows,” “experts agree,” “according to studies” with no citations.
Human-in-the-Loop Where It Counts. For critical flows, prefer no answer over fast fiction. Align rewards accordingly.

The Human Side: Why People Err Differently

Human cognition is not a glitchy emulator; it is optimized for survival in limited information environments. A few contrasts illuminate why human and LLM mistakes look alike yet diverge in cause:

Reconstructive Memory vs. Token Prediction. Humans rebuild scenes from cues and schemas; LLMs extend a string by statistical likelihoods. Confabulation and hallucination rhyme but are not the same music.
Motivation and Intention. People sometimes lie, omit, or rationalize for social and instrumental reasons. LLMs lack intention; their “deception” is reward- or instruction-shaped behavior, not volition.
Forgetting. Humans lose access to traces via interference and decay; LLMs “forget” because prior tokens fell out of the window or because the relevant store was never queried.
Learning Dynamics. People consolidate across time and sleep; LLMs update only when retrained or when their retrieval corpus changes. “Continual learning” failures (catastrophic forgetting) are engineering, not psychology.

These contrasts matter because they point to distinct remedies: incentives and training for humans; retrieval, decoding, and policy design for models.

Measurement: From Error Counts to Useful Metrics

Counting errors is easy; controlling them is hard. Better to measure what improves decisions:

Calibration Error. The gap between stated confidence and empirical correctness. Good systems minimize this gap so that users can weight answers rationally.
Abstention Quality. How often the system says, “I don’t know,” and how often abstentions are justified.
Evidence Coverage. Fraction of claims backed by evidence objects. Aim near 100% in factual and regulated flows.
Freshness. Median age of sources actually used. Stale evidence is a vector for silent error.
Cost of Correction. Time and effort for a human reviewer to verify and fix. Interface design matters here: clarity of evidence, formatting, and uncertainty saves hours.

Operating Patterns for Real Teams

To make the table actionable, here are operating patterns that organizations can adopt without boiling the ocean:

Two-Pass Fact Production.
Pass A: Generate a structured list of claims with proposed sources and confidence.
Pass B: Compose narrative only from claims that pass threshold checks.
This neatly suppresses hallucinations and concentrates human review where it matters.
Retriever Health Dashboard.
Track hit-rate, duplication rate, and average citation age per request. Alert on low-hit regimes that correlate with fabrication.
Named Decoding Modes.
Define, test, and publish the exact decoding parameters and safety rules for each mode. Product and compliance teams can then reason about risk per mode.
Redaction Reason Codes.
When content is withheld, return a short code (“P1 – PII policy”, “S2 – Safety: dual-use”) with one line of plain-language rationale. Confusion drops; trust rises.
Deliberate Error Sandboxes.
Creative and pedagogical use-cases can invite intentional mistakes—but only inside clearly labeled sandboxes with unit tests that ensure the errors stay inside the sandbox.
Failure Case Library.
Maintain a living catalog of your system’s worst blunders, tagged by category in the table. Use it for training, onboarding, and regression testing.
“Silence Is Golden” Rewards.
Many teams unconsciously punish abstention and reward verbose answers. In critical flows, invert that: a paid bounty for correct abstention beats a flashy wrong answer.

Frequently Confused Pairs

A few pairs cause recurring confusion in reviews:

Hallucination vs. Omission.
The first invents; the second withholds. Remedies differ: retrieval and decoding for the former; transparency and access control UI for the latter.
Lying vs. Role-Played Invention.
If the system is explicitly asked to invent and the output is labeled as such, the behavior is aligned—even if the content is false. The error is not deception but labeling failure or scope creep if that output leaks into factual channels.
Forgetting vs. Truncation.
If early context is gone, it’s not “forgetting” in a psychological sense; it is context management failure. Add summaries or memory pinning.
Human-like vs. Human.
Fluency and narrative coherence can mimic human voice. Humanness in output does not imply human cognition in cause. Keep the substrate in view.

Putting It All Together

The temptation is to demand “no hallucinations.” A stronger posture is to articulate which failure modes matter for your workload and why, then map precise controls to them. For creative ideation, tolerate (even encourage) synthetic leaps. For compliance reporting, enforce retrieval-first pipelines, confidence thresholds, and abstention. For education, use deliberate errors but fence them with tests and labels.

The summary table is a compact heuristic for choosing the right lever:

When outputs are plausible but wrong, lower temperature, improve retrieval, and require evidence.
When content is missing, explain redaction and review access rather than tuning sampling.
When falsity appears in a high-stakes context, reward abstention and split drafting from verification.
When earlier details vanish, add summaries or extend memory instead of blaming “forgetfulness.”

The surface of error may be shared; its anatomy is not. Effective practice starts where mechanisms live.

Appendix: Using the Table in the Wild

Design reviews. Walk feature specs against the table and check each relevant row: have the right diagnostics and controls been chosen?
Incident response. When a bad output lands, classify it quickly using the categories. Mechanism-matched remediation saves weeks.
Training and onboarding. Teach product owners and reviewers the distinction between phenomenology (what it looks like) and substrate (why it happened).
Vendor evaluation. Ask vendors to demonstrate controls row-by-row for your critical categories. Prefer observable telemetry over promises.

Acknowledgment. This synthesis stands on the shoulders of work in cognitive science and machine learning, but it remains a field guide rather than a treatise. It privileges practical levers—how to detect, explain, and reduce harm—over theory for its own sake. If it helps one team design for truthfulness and restraint where it matters, and for creative exploration where it does not, it has done its job.