Recently, OpenAI's transcription tool Whisper was revealed to have serious problems with the so-called "hallucination" phenomenon. That is, the tool generates large segments or even whole sentences of false content out of thin air during the transcription process. These so-called "hallucinations" can include racial slurs, violent language, and even fabricated medical advice.
According to the engineers and researchers involved, the hallucinations were present in about half of the more than 100 hours of transcripts taken through Whisper. Some developers found hallucinations in nearly every one of the 26,000 transcripts they created using Whisper.
OpenAI has warned against using Whisper in high-risk areas, such as medical decision-making situations, and the company says it will continue to study ways to reduce hallucinations and incorporate feedback mechanisms in model updates. Therefore, there is a real risk that OpenAI's transcription tools will produce hallucinations, especially when used without rigorous validation and in high-risk application scenarios.
There are several reasons for the hallucination of the OpenAI tool Whisper!
The data is biased. The Whisper model was trained with a bias in the data set, which allowed the model to learn wrong patterns and associations, resulting in false content that did not match reality during transcription.
The model architecture is inadequate. Whisper's model architecture has some flaws that make it more likely to hallucinate when processing this type of audio information, such as pauses, background noise, or music.
Lack of common sense reasoning. The Whisper model doesn't have the same common sense reasoning ability as humans and can't directly judge the truth of information based on context, so it's easier to generate virtual content to fill in the gaps when it encounters ambiguous information.
Effects on specific groups. The study showed that Whisper was more prone to errors when analyzing speech with long pauses, and these factors together affected the transcription process of the Whisper model. There are also safety and ethical risks associated with inaccurate final transcription results.
To avoid this illusion, there are some technical means. Like Mind's Mirror. This is a way to solve the hallucination problem by integrating the self-assessment ability of a large language model into a small speech model with comprehensive knowledge transfer.
DRESS uses conditional reinforcement learning to enable models to generate responses based on natural language feedback, improving alignment of human preferences and interaction capabilities. MixAlign uses a language model to automatically align, enhance alignment with user clarification, and produce questions that seek clarification from users when there is uncertainty or unclear evidence. Using knowledge maps (KG), knowledge maps can provide detailed information about entities, specific diagnoses, and links, which can be used for complex reasoning, data analysis, and information retrieval, as well as to alleviate hallucinations.