Join our Live Workshop: Tuesdays & Thursdays, 10:00 – 11:00 AM IST Sign Up Now →
Join our Live Workshop: Tuesdays & Thursdays, 10:00 – 11:00 AM IST Sign Up Now →
TECHNICAL / FEB 5, 2025

Confidence Thresholds

The math behind guaranteed accuracy in retrieval-augmented generation.

Confidence Gating Visualized

This article provides a technical deep-dive into confidence thresholds—one component of our Self-Healing RAG architecture. If you haven't read the architecture overview, start there for context on how confidence gating fits into the broader reliability system.

In standard RAG systems, "retrieval" is a simple matter of cosine similarity. The system looks for chunks of text that are geometrically similar to the user's query and feeds them to the LLM. The problem? Geometric similarity is not the same as factual relevance.

The Geometry of Retrieval

When you convert text into a vector (a list of numbers), you are mapping its semantic meaning into a multi-dimensional space. "How do I report a policy violation?" and "Policy violation reporting procedures" will be close together in this space. But "How do I report a policy violation?" and "How do I commit a policy violation?" might also be dangerously close.

Standard RAG doesn't care about the difference. It just returns the top k results. If the best result has a similarity score of 0.65, the system still uses it, even though 0.65 is essentially a guess.

Scoring Truth, Not Similarity

At Episteca, we move beyond simple similarity. Our scoring algorithm uses a multi-faceted approach to quantify uncertainty:

  1. Vector Distance Calibration: We normalize raw similarity scores based on the specific density of the local vector neighborhood.
  2. Semantic Entailment: We use a separate, smaller model to verify if the retrieved chunk actually "entails" (logically supports) the user's intent.
  3. Query Complexity Analysis: We score the query itself. If it contains multiple negation or complex conditionals, the confidence requirement increases.

Calibration and Reliability

A score is meaningless without calibration. If a system says it's "80% sure" but is only right 60% of the time, the score is a lie. We use Expected Calibration Error (ECE) as our primary metric. We calibrate our models so that a confidence score of 0.90 means that if we ran that query 100 times, we would be factually correct at least 90 times.

In our compliance deployments, we set the refusal threshold at 0.85. If the aggregate confidence score falls below this line, the Knowledge Firewall triggers a deterministic refusal.

"True intelligence is knowing the precise boundaries of your own knowledge. Math allows us to define those boundaries."

The Bigger Picture

Confidence thresholds enable our zero-hallucination guarantee—the promise that every response is either verified or transparently uncertain.

By quantifying what the AI doesn't know, we make it safe for the enterprise. You don't need an AI that knows everything. You need an AI that is never wrong about what it claims to know.

Explore the Math of Accuracy

Download our whitepaper on confidence scoring and RAG calibration.

Request Whitepaper

References

  1. Guo, C., et al. (2017). "On Calibration of Modern Neural Networks." ICML.
  2. Naeini, M. P., et al. (2015). "Obtaining Well-Calibrated Probabilities Using Bayesian Binning into Quantiles." AAAI.
  3. Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP.
  4. Lakshminarayanan, B., et al. (2017). "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles." NeurIPS.