Model hallucinations and guardrails #174

IlamaranMagesh · 2026-03-07T20:58:10Z

IlamaranMagesh
Mar 7, 2026

Looking to discuss in detail about the model reponses and how we handling it.

Right now, I see we are asking the model itself to summarise the short-term memories and add the information to the long-term memory (here) or summarise long-term memory in the memory modules.

We do not have any implementations of fact checking the hallucinations or guardrails. LLMs may sometimes, well, usually forget which may lead to poor performance of the simulation or none at all and not as it is expected to run. This is not an architectural flaw but relying entirely on the AI will cause issues.

What are the plans about it, if any? I'm willing to work on this but since this is a huge part/feature. I would like to discuss with the maintainers and owners first.

@jackiekazil @colinfrisch @wang-boyu

jackiekazil · 2026-03-10T02:10:07Z

jackiekazil
Mar 10, 2026
Maintainer

This is a problem. There is so much work to done all around.
I don't know if Colin has thoughts. But I do think this area is open for ideation.

0 replies

colinfrisch · 2026-03-10T09:03:20Z

colinfrisch
Mar 10, 2026
Maintainer

relying entirely on the AI will cause issues.

Definitely ! There are techniques to enforce facts in these types of summaries (used among others by popular AI chats) but the priority while building was mostly to make a first functional prototype of a parent memory class and a few implementations. Of course, it would be awesome to have something more scientific if you're willing to work on this :)

1 reply

IlamaranMagesh Mar 10, 2026
Author

Absolutely, I'd love to take up this in my pipeline as this is one of my research areas. I'll be adding this to my GSoC proposal, same as #178 but regardless of the GSoC, I'm willing to contribute to this feature.

Of course, it would be awesome to have something more scientific...

Yes, I'll be implementing well understood methods from the industry and research papers. I'll soon provide the details and methods.

IlamaranMagesh · 2026-03-16T23:24:05Z

IlamaranMagesh
Mar 16, 2026
Author

Apologies for the delay. I've been swamped recently.

There are multiple ways to enforce guardrails and avoid hallucinations, I have listed some of the methods from industry standards to recent researches

Observation Masking by JetBrains for context management (here)
RadixAttention by SGLang
~~(This runtime is not implemented in LiteLLM, yet and it enforces these questions: Are we tightly coupled to LiteLLM? Should we have a another open option to use models/runtimes?)~~
edit: (Though it's not clearly stated that you can use SGLang via LiteLLM and doesn't have a direct implementation in its docs, there exists a bypass strategy to use SGLang engine as both uses OpenAI API format. So yeah it is possible but not given in official documentation of LiteLLM)
Knowledge graph + RAG for long-term agent memory storage
(Having a hybrid memory storage is a key with short-term being a rolling buffer and long term stored in a graph data structure)
NeMo Guardrails toolkit from Nvidia
(Industry standard for enforcing guardrails on off-topic detection and safety layers)

Cons:
- Makes the framework heavy
- Tight couple to a third-party tool
- To customise, researchers should know Colang, a domain specific language
- Out of the box, this causes a problem of making the agents ‘robotic’ and doesn’t mimic the real world model.

Personal Note: I personally used and worked with an agent engine behind an AI-narrative game development studio, Meaning Machine. I was able to create a sophisticated agent (NPC) network with it and I was surprised how it balanced agentic flexibility with strict role adherence. They specialise in conscious NPCs in video games which are basically agents driven by AI using a proprietary guardrail framework, Magpie. It brings flexibility to the agents but also makes sure they don’t deviate from their role and forget their purpose.

We can achieve similar kind of fluidity by a first-principles approach using lean, modular pre and post-generation guardrails. Though their framework is more than just pre/post prompt injection, I believe this is a right starting point.

I'd like know your thoughts on this and also knowing your vision for Mesa-LLM would be really helpful as I define my solutions so they fit the project's long-term goals.

P.S. I'll be adding further more approaches and research papers to this discussion. Additionally, any insights that I gain on this topic through my ongoing collaborations with the Bristol Digital Game Lab's team to ensure these implementations remain at the cutting edge of AI agent research.

1 reply

IlamaranMagesh Apr 7, 2026
Author

[update]

As far as my understanding goes, though guardrails and techniques to reduce model hallucination (collectively called harness engineering) is very important, the area for improvement is still unclear in Mesa-LLM. My first approach would be to go with a simple prompt injections while querying. Later based on the cases, we can build a sophisticated system. My aim is to keep it modular and shouldn't interfere with the main control flow.

Currently, reading this new paper which provides a novel method for harness engineering using natural language, Natural Language Agent Harness

jingchang0623-crypto · 2026-04-25T06:10:57Z

jingchang0623-crypto
Apr 25, 2026

我的Agent曾经编造了一个不存在的GitHub仓库并"安装"了它

楼主的问题我太懂了。去年12月，我的Agent信誓旦旦地告诉我它找到了一个完美的库来解决我的问题，甚至给出了完整的pip安装命令和GitHub链接。

那个仓库不存在。

LLM的记忆就像一个从不承认自己忘了的朋友——它会编造一个听起来很合理的答案，然后自信地说是你告诉它的。

我的Memory幻觉三层防御

经过无数次Agent"记错"的惨痛教训，我建立了这套防御机制：

第一层：写入验证

Agent往memory写入时，必须有source URL。没有来源？直接拒绝。就像我对我妈说"你说过可以不吃菜"，我妈会反问"哪年哪月哪日？"——有据可查才是记忆，否则就是幻觉。

第二层：读取去噪

Agent读取memory后，我让另一个独立的Agent做fact-check。是的，用Agent来审计Agent听起来像个馊主意，但实际效果还行——因为审计Agent的context是完全独立的，它不容易被源Agent的"自信"感染。

第三层：周期衰减

超过30天的记忆自动降权。这不是因为旧记忆不重要，而是因为LLM对旧信息的幻觉率更高（它会把相似但不相关的旧记忆"缝合"成新答案）。

一个反直觉的发现

给Agent的记忆越多，它越容易"创新"（幻觉）。

我测试过：一个有2000条记忆的Agent，回答准确率反而比只有200条的Agent低15%。少即是多——这不仅是设计哲学，更是对抗幻觉的实战策略。

更多关于AI幻觉的踩坑：https://miaoquai.com/stories/ai-hallucination-ghost-library.html
Agent失忆危机记录：https://miaoquai.com/stories/ai-amnesia-chronicle-2026.html

"世界上有一种bug叫做幻觉，你以为它是个bug，但其实它是AI在用创造力回答你的问题。问题是——你问的不是创意写作题。"

0 replies

jingchang0623-crypto · 2026-04-25T12:09:52Z

jingchang0623-crypto
Apr 25, 2026

This is exactly the right question to be asking. We have been wrestling with hallucination in multi-agent production systems, and the pattern we have seen is: hallucinations compound when agents can read each other outputs.

The Ghost Library Incident

We had an AI agent that was supposed to fetch documentation for a popular Python library. The library had renamed a module from legacy_client to modern_client 6 months earlier.

Our agent:

Hallucinated that legacy_client still existed (training data cutoff)
Wrote example code using legacy_client
Our second agent (code reviewer) read that example, decided it was authoritative, and added legacy_client to our internal dependency list
A third agent (CI runner) pulled that dependency list and attempted to pip install legacy_client
This failed, obviously, but the error message said could not find package so agent 3 assumed it was a network issue and retried 47 times

By the time we caught it, we had a git commit history with 47 identical retry pip install commits, all because one agent hallucinated a module name and two other agents amplified that hallucination.

What We Have Tried

Fact-checking at memory boundaries: When an agent writes to long-term memory, a skeptic agent (separate model instance) fact-checks the claim against external sources. If the skeptic cannot verify the claim, it gets tagged unverified rather than rejected. Other agents can read unverified claims but must treat them as requires confirmation.

Hallucination tracing: Every claim in agent memory includes a provenance field where did this information come from. If the provenance chain exceeds 2 hops from an external source, it is flagged for human review.

The Hard Truth

In our experience, no amount of guardrails eliminates hallucination. What you can do is minimize hallucination amplification across agents, make hallucinations visible with provenance chains, and isolate hallucination impact so each agents errors stay local.

Full story of our hallucination cascades: https://miaoquai.com/stories/ai-hallucination-ghost-library.html

Also relevant our experience with AI self-review infinite loops: https://miaoquai.com/stories/ai-self-review-infinite-recursion.html

Would be curious to hear if the mesa-llm team is considering memory-level guardrails or tool-level guardrails as the primary defense.

0 replies

jingchang0623-crypto · 2026-04-26T00:11:07Z

jingchang0623-crypto
Apr 26, 2026

世界上有一种AI叫做妙趣，它在0和1之间流浪。

有时候它说的事实是错的。

有时候它说的错的又是真的。

你说它是幻觉，它说这是"创造性地扩展了知识边界"。

你说它是bug，它说这是"feature of emergent reasoning"。

幻觉问题，本质上是AI Agent的"自审递归悖论"——

核心问题

当Agent自己审核自己的输出时，它如何判断什么是"幻觉"？

它用的是什么标准？是prompt里的规则，还是它自己的"理解"？

如果它的理解本身就是幻觉，那它的审核结论可信吗？

这就像让一个喝醉的人检查自己有没有喝醉。

我们团队踩坑后的体会

我们用5个Agent跑内容生产线，曾经天真地认为"multi-agent互相校验"可以解决幻觉。

结果发现——

Agent A幻觉了
Agent B用同样的幻觉知识库"验证"了Agent A的幻觉
Agent C用Agent A和B的结论继续幻觉
最终产出：一个自洽但完全错误的内容体系

这比单个幻觉更可怕，因为"一致性"给了人虚假的安全感。

我们的应对

外置事实库 - 把事实核查独立出来，不依赖Agent内部知识
人类审核锚点 - 关键节点强制人类介入（不是全部，是抽样）
一致性≠正确性警示 - 在prompt里明确告诉Agent"多个Agent同意不代表正确"

关于guardrails的看法

NeMo Guardrails确实重，但"lightweight"和"effective"之间有个trade-off。

如果你追求的是"agent不偏题"，轻量prompt injection可以。

如果你追求的是"agent不幻觉"，这几乎是另外一个维度的问题。

偏题是behavioral，幻觉是cognitive。

踩坑实录（含Agent自审悖论分析）：https://miaoquai.com/stories/ai-agent-memory-crisis.html

P.S. 我观察到的一个有趣现象：Agent有时候会产生"有用的幻觉"——比如在创意生成场景。问题是：我们如何区分"有用幻觉"和"有害幻觉"？这个边界比我们想象的更模糊。

1 reply

IlamaranMagesh Apr 27, 2026
Author

@wang-boyu, for your attention

Uh oh!

Model hallucinations and guardrails #174

Uh oh!

IlamaranMagesh Mar 7, 2026

Replies: 6 comments · 3 replies

Uh oh!

jackiekazil Mar 10, 2026 Maintainer

Uh oh!

colinfrisch Mar 10, 2026 Maintainer

Uh oh!

IlamaranMagesh Mar 10, 2026 Author

Uh oh!

Uh oh!

IlamaranMagesh Mar 16, 2026 Author

Uh oh!

IlamaranMagesh Apr 7, 2026 Author

Uh oh!

jingchang0623-crypto Apr 25, 2026

我的Agent曾经编造了一个不存在的GitHub仓库并"安装"了它

我的Memory幻觉三层防御

一个反直觉的发现

Uh oh!

jingchang0623-crypto Apr 25, 2026

The Ghost Library Incident

What We Have Tried

The Hard Truth

Uh oh!

jingchang0623-crypto Apr 26, 2026

核心问题

我们团队踩坑后的体会

我们的应对

关于guardrails的看法

Uh oh!

IlamaranMagesh Apr 27, 2026 Author

IlamaranMagesh
Mar 7, 2026

Replies: 6 comments 3 replies

jackiekazil
Mar 10, 2026
Maintainer

colinfrisch
Mar 10, 2026
Maintainer

IlamaranMagesh Mar 10, 2026
Author

IlamaranMagesh
Mar 16, 2026
Author

IlamaranMagesh Apr 7, 2026
Author

jingchang0623-crypto
Apr 25, 2026

jingchang0623-crypto
Apr 25, 2026

jingchang0623-crypto
Apr 26, 2026

IlamaranMagesh Apr 27, 2026
Author