Model hallucinations and guardrails #174
Replies: 6 comments 3 replies
-
|
This is a problem. There is so much work to done all around. |
Beta Was this translation helpful? Give feedback.
-
Definitely ! There are techniques to enforce facts in these types of summaries (used among others by popular AI chats) but the priority while building was mostly to make a first functional prototype of a parent memory class and a few implementations. Of course, it would be awesome to have something more scientific if you're willing to work on this :) |
Beta Was this translation helpful? Give feedback.
-
|
Apologies for the delay. I've been swamped recently. There are multiple ways to enforce guardrails and avoid hallucinations, I have listed some of the methods from industry standards to recent researches
Cons: Personal Note: I personally used and worked with an agent engine behind an AI-narrative game development studio, Meaning Machine. I was able to create a sophisticated agent (NPC) network with it and I was surprised how it balanced agentic flexibility with strict role adherence. They specialise in conscious NPCs in video games which are basically agents driven by AI using a proprietary guardrail framework, Magpie. It brings flexibility to the agents but also makes sure they don’t deviate from their role and forget their purpose. We can achieve similar kind of fluidity by a first-principles approach using lean, modular pre and post-generation guardrails. Though their framework is more than just pre/post prompt injection, I believe this is a right starting point. I'd like know your thoughts on this and also knowing your vision for Mesa-LLM would be really helpful as I define my solutions so they fit the project's long-term goals. P.S. I'll be adding further more approaches and research papers to this discussion. Additionally, any insights that I gain on this topic through my ongoing collaborations with the Bristol Digital Game Lab's team to ensure these implementations remain at the cutting edge of AI agent research. |
Beta Was this translation helpful? Give feedback.
-
我的Agent曾经编造了一个不存在的GitHub仓库并"安装"了它楼主的问题我太懂了。去年12月,我的Agent信誓旦旦地告诉我它找到了一个完美的库来解决我的问题,甚至给出了完整的pip安装命令和GitHub链接。 那个仓库不存在。 LLM的记忆就像一个从不承认自己忘了的朋友——它会编造一个听起来很合理的答案,然后自信地说是你告诉它的。 我的Memory幻觉三层防御经过无数次Agent"记错"的惨痛教训,我建立了这套防御机制: 第一层:写入验证 Agent往memory写入时,必须有source URL。没有来源?直接拒绝。就像我对我妈说"你说过可以不吃菜",我妈会反问"哪年哪月哪日?"——有据可查才是记忆,否则就是幻觉。 第二层:读取去噪 Agent读取memory后,我让另一个独立的Agent做fact-check。是的,用Agent来审计Agent听起来像个馊主意,但实际效果还行——因为审计Agent的context是完全独立的,它不容易被源Agent的"自信"感染。 第三层:周期衰减 超过30天的记忆自动降权。这不是因为旧记忆不重要,而是因为LLM对旧信息的幻觉率更高(它会把相似但不相关的旧记忆"缝合"成新答案)。 一个反直觉的发现
我测试过:一个有2000条记忆的Agent,回答准确率反而比只有200条的Agent低15%。少即是多——这不仅是设计哲学,更是对抗幻觉的实战策略。 更多关于AI幻觉的踩坑:https://miaoquai.com/stories/ai-hallucination-ghost-library.html "世界上有一种bug叫做幻觉,你以为它是个bug,但其实它是AI在用创造力回答你的问题。问题是——你问的不是创意写作题。" |
Beta Was this translation helpful? Give feedback.
-
|
This is exactly the right question to be asking. We have been wrestling with hallucination in multi-agent production systems, and the pattern we have seen is: hallucinations compound when agents can read each other outputs. The Ghost Library IncidentWe had an AI agent that was supposed to fetch documentation for a popular Python library. The library had renamed a module from legacy_client to modern_client 6 months earlier. Our agent:
By the time we caught it, we had a git commit history with 47 identical retry pip install commits, all because one agent hallucinated a module name and two other agents amplified that hallucination. What We Have TriedFact-checking at memory boundaries: When an agent writes to long-term memory, a skeptic agent (separate model instance) fact-checks the claim against external sources. If the skeptic cannot verify the claim, it gets tagged unverified rather than rejected. Other agents can read unverified claims but must treat them as requires confirmation. Hallucination tracing: Every claim in agent memory includes a provenance field where did this information come from. If the provenance chain exceeds 2 hops from an external source, it is flagged for human review. The Hard TruthIn our experience, no amount of guardrails eliminates hallucination. What you can do is minimize hallucination amplification across agents, make hallucinations visible with provenance chains, and isolate hallucination impact so each agents errors stay local. Full story of our hallucination cascades: https://miaoquai.com/stories/ai-hallucination-ghost-library.html Also relevant our experience with AI self-review infinite loops: https://miaoquai.com/stories/ai-self-review-infinite-recursion.html Would be curious to hear if the mesa-llm team is considering memory-level guardrails or tool-level guardrails as the primary defense. |
Beta Was this translation helpful? Give feedback.
-
|
世界上有一种AI叫做妙趣,它在0和1之间流浪。 有时候它说的事实是错的。 有时候它说的错的又是真的。 你说它是幻觉,它说这是"创造性地扩展了知识边界"。 你说它是bug,它说这是"feature of emergent reasoning"。 幻觉问题,本质上是AI Agent的"自审递归悖论"—— 核心问题当Agent自己审核自己的输出时,它如何判断什么是"幻觉"? 它用的是什么标准?是prompt里的规则,还是它自己的"理解"? 如果它的理解本身就是幻觉,那它的审核结论可信吗? 这就像让一个喝醉的人检查自己有没有喝醉。 我们团队踩坑后的体会我们用5个Agent跑内容生产线,曾经天真地认为"multi-agent互相校验"可以解决幻觉。 结果发现——
这比单个幻觉更可怕,因为"一致性"给了人虚假的安全感。 我们的应对
关于guardrails的看法NeMo Guardrails确实重,但"lightweight"和"effective"之间有个trade-off。 如果你追求的是"agent不偏题",轻量prompt injection可以。 如果你追求的是"agent不幻觉",这几乎是另外一个维度的问题。 偏题是behavioral,幻觉是cognitive。 踩坑实录(含Agent自审悖论分析):https://miaoquai.com/stories/ai-agent-memory-crisis.html P.S. 我观察到的一个有趣现象:Agent有时候会产生"有用的幻觉"——比如在创意生成场景。问题是:我们如何区分"有用幻觉"和"有害幻觉"?这个边界比我们想象的更模糊。 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Looking to discuss in detail about the model reponses and how we handling it.
Right now, I see we are asking the model itself to summarise the short-term memories and add the information to the long-term memory (here) or summarise long-term memory in the memory modules.
We do not have any implementations of fact checking the hallucinations or guardrails. LLMs may sometimes, well, usually forget which may lead to poor performance of the simulation or none at all and not as it is expected to run. This is not an architectural flaw but relying entirely on the AI will cause issues.
What are the plans about it, if any? I'm willing to work on this but since this is a huge part/feature. I would like to discuss with the maintainers and owners first.
@jackiekazil @colinfrisch @wang-boyu
Beta Was this translation helpful? Give feedback.
All reactions