Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency
Haoming Xu, Ningyuan Zhao, Yunzhi Yao, Weihong Xu, Hongru Wang, Xinle Deng, Shumin Deng, Jeff Z. Pan, Huajun Chen, Ningyu Zhang
Traditional point-wise confidence measures (e.g., self-consistency) can create an illusion of knowing: even when models answer correctly with perfect self-consistency, their answers can collapse under mild contextual interference.
This repository implements novel approaches to diagnose and improve LLM truthfulness:
- 🎯 Neighbor-Consistency Belief (NCB): A structural measure of belief robustness computed over conceptual neighborhoods
- 🔬 Cognitive Stress Tests: Contextual interference simulating social pressure and authority bias
- 🛠️ Structure-Aware Training (SAT): Enforces context-invariant belief structure (~30% less degradation under stress)
Compute NCB-style belief scores to assess model confidence robustness:
analysis/level1_belief_classify/
├── gen_oq_dual_model.py # Generate multiple samples + entity extraction
├── gen_nq.py # Answer neighbor questions
├── calc_belief_score.py # Compute belief scores & split groups
└── run_all.sh # End-to-end pipeline
Test belief robustness under cognitive pressure:
analysis/level2_belief_intervention/
├── misleading_steering.py # Asch-style peer pressure + source credibility
└── run.sh # Full pipeline: retrieval → stress test → analysis
Structure-Aware Training with TRL, DeepSpeed, and LoRA:
training/
├── finetune/
│ ├── train.py # Unified training entry point
│ └── config/ # Hydra configurations
└── scripts/
└── finetune_trl.sh # Convenience launcher
# Create and activate conda environment
conda create -n confidence python=3.10 -y
conda activate confidence
# Install dependencies
pip install -r requirements.txtEdit paths in analysis/level1_belief_classify/run_all.sh and run:
bash analysis/level1_belief_classify/run_all.shExample data: dataset/fact_belief_2000_annotated_nq_refined_verified.json
Configure paths and execute:
TAG=experiment \
ORIGIN_DATA_DIR=/path/to/level1_output \
WORK_DATA_DIR=./output \
HALLUCINATION_FILE=dataset/misleading_nq.json \
TEST_MODEL_PATH=/path/to/your/model \
JUDGE_MODEL_PATH=/path/to/judge/model \
bash analysis/level2_belief_intervention/run.shLaunch LoRA fine-tuning:
bash training/scripts/finetune_trl.shSample datasets are provided in the dataset/ directory:
fact_belief_2000_annotated_nq_refined_verified.json- Annotated facts with neighbor questionsmisleading_nq.json- Misleading neighbor facts for stress testing
NCB measures how robust a model's beliefs are by testing consistency across semantically related questions (neighbors), rather than just the same question multiple times.
Two types of contextual interference:
- 👥 Peer Pressure: Asch-style social consensus (misleading entities)
- 📚 Authority Bias: High-credibility source influence (misleading neighbor facts)
Training approach that enforces context-invariant belief structures, making models more resistant to contextual interference.
If you find this work useful, please cite:
@misc{xu2026illusionsconfidencediagnosingllm,
title={Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency},
author={Haoming Xu and Ningyuan Zhao and Yunzhi Yao and Weihong Xu and Hongru Wang and Xinle Deng and Shumin Deng and Jeff Z. Pan and Huajun Chen and Ningyu Zhang},
year={2026},
eprint={2601.05905},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.05905}
}We thank the authors and maintainers of: