In recent times, hypotheses have become one of the most critical topics for companies striving to enhance the use of large language models in artificial intelligence systems such as ChatGPT. The problem lies in the fact that there are numerous […]
In recent times, hypotheses have become one of the most critical topics for companies striving to enhance the use of large language models in artificial intelligence systems such as ChatGPT. The problem lies in the fact that there are numerous ways to frame the issue. The obvious approach is to ask people, but this method does not work well on a large scale and comes with a significant cost.
Over the past six months, several providers have developed various metrics and processes to aid the automation of this process. Let’s delve deeper into TruEra, a leader in monitoring, testing, and ensuring the quality of machine learning. In March, they introduced novel techniques to detect and mitigate hallucinations as part of a broader open framework for language models in artificial intelligence. This preceded recent hallucination metrics previously introduced by Galileo and Vectara.
TruEra takes a slightly different approach to defining hallucinations compared to other providers. Their approach combines evaluation components, deep monitoring, event logging, and the ability to scale on larger datasets continuously. Shaike Sen, the founder and CTO of TruEra, claims that some competing approaches to managing hallucinations focus on individual parts of the problem. TruEra’s approach also shapes the way they think about hallucinations. Shaike Sen explains:
A widely accepted definition of hallucinations is that it occurs when a language model produces output that is inaccurate. However, without a source of truth, this definition is impossible to apply. We promote a stricter definition of hallucinations: the output of the language system is hallucinatory if it responds to a query in a manner that does not accurately represent the source of truth in a verified way.
One consequence of this definition is that using ChatGPT (or any other language model) as a question-answering system is considered hallucinatory by default. Its objective is not to present facts but to generate possible text that can be either correct or incorrect. The fact that it is often correct is coincidental. This may seem like a somewhat radical viewpoint, but in reality, the fact that generative models hallucinate should be viewed as a characteristic rather than an error.
An Approach through Processes
Sen argues that one way to build systems that truly represent sources of truth lies in improving the way interactions with language models can be adjusted through an enhanced generative approach with search support. In the RAG architecture, the task of the language model is not to produce facts but to summarize information pulled from databases or APIs. In this context, hallucinations can be verified by answering three questions:
Relevance to the query: Is the response relevant to the query?
Context relevance: Is the pulled context relevant to the query?
Verifiability: Is the response supported by the context?
If the answer to any of these questions is “No,” then the system’s output could be misleading or irrelevant. In the TruLens approach, hallucination metrics capture different failure modes of language model-based systems.
Identifying the metric that performs the poorest can help teams focus on the aspect of the system that needs improvement. For example, if implementation hallucinates because it frequently utilizes irrelevant context, developers can prioritize enhancing the extraction of relevant information for a given purpose.
It is also crucial to have a robust system that tracks performance changes over time as teams experiment with different system configurations. Evaluation and monitoring are crucial throughout the entire application lifecycle. This reduces the risk of overly focusing on fixing individual instances while ignoring broader system quality.
Companies can employ these algorithms throughout the entire lifecycle of their systems, from development to production, in order to:
Build confidence that fundamental edge cases are covered before implementation.
Use evaluations to guide improvements of their systems by prioritizing the root causes of hallucinations.
Monitor performance over time to quickly detect and resolve regressions.
Understanding the root causes of the problem helps create a feedback loop that determines what kind of fixes need to be applied. For example, if the model performs poorly in extracting relevant content, then it is most important to work on better extraction. Conversely, if verifiability is the key issue, fine-tuning and engineering queries to achieve good results on domain-specific data are likely the most effective approach. In any case, it is important to systematically test the system and track improvements.
One significant limitation in hallucination research is its scope. Since most evaluations are based on the language models themselves, these evaluations are challenging to apply in production. Shaike Sen stated that future research and development will focus on ways to algorithmically scale the evaluation of language model hallucinations, allowing them to be applied to a larger number of useful cases efficiently.
Additionally, more work will be needed to mitigate hallucinations that involve combining text, code, audio, video, and other types of data. Shaike Sen explains:
Furthermore, as models grow larger and more diverse, the focus is increasingly shifting towards multimodal models, meaning evaluations need to be targeted towards multimodal scenarios. We need new tools to understand what hallucinations mean in a multimodal environment.
Currently, researchers and providers are all striving to find the most effective approaches to measure and reduce hallucinations in artificial intelligence. This will be crucial for scaling new generative AI models in enterprises. Moreover, some of the best hallucination metrics may be one-time techniques.
TruEra approaches the problem as part of a broader solution for streamlining the life cycle of AI development. It is reasonable to assume that competitors in AI development will likely soon introduce similar capabilities. These can be used directly in the tool or through third-party add-ons and platforms, much like how quality assurance and testing are integrated into development environments today.
Additionally, many of the current hallucination metrics focus on the type of conversation one may have with a chatbot. Different approaches are needed to enhance the quality of code suggestions and other types of recommendations in the rapidly evolving world of AI that powers assistants and co-pilot applications.