Reveal - Transforming Research with AI-Powered Insights

When it comes to converting spoken language into written text, the accuracy and quality of transcription are crucial. Various AI based transcription tools are available in the market, each with different levels of performance. This blog post will break down key metrics that determine the quality of these transcriptions, focusing on a recent benchmark study.

Key Metrics for Transcription Quality

Word Accuracy Rate

Definition: This metric measures how accurately a transcription model can convert spoken words into written text.
Performance: Reveal correctly transcribes a higher percentage of words compared to competitors like OpenAI, Microsoft, and Google.

Language	Reveal	OpenAI	Microsoft	Google
English	92.7%	91.6%	90.6%	86.4%
Spanish	95.2%	94.0%	92.8%	91.3%
German	92.5%	91.6%	91.1%	87.3%

Word Error Rate (WER)

Definition: WER calculates the errors made in transcriptions, including insertions, deletions, and substitutions of words.
Performance: Lower WER indicates better transcription quality. Reveal’s model has a WER of 7.3% in English, 4.8% in Spanish, and 7.5% in German, outperforming other models which have higher error rates.

Language	Reveal	OpenAI	Microsoft	Google
English	7.3%	8.4%	9.4%	13.6%
Spanish	4.8%	6.0%	7.2%	8.7%
German	7.5%	8.4%	8.9%	12.7%

Consecutive Error Types

Definition: This metric looks at specific types of errors over a long period, such as fabrications (incorrectly added words), omissions (missing words), and hallucinations (strings of consecutive errors).
Performance: Reveal shows a 30% reduction in hallucination rates compared to Whisper Large-v3, with lower rates of fabrications (5.2% vs. 8.8%) and omissions (5.6% vs. 7.1%).

Error Type	Reveal	Whisper Large-v3
Fabrications	5.2%	8.8%
Omissions	5.6%	7.1%
Hallucinations	12.9%	18.4%

Importance of Accurate Transcriptions

Accurate transcriptions are vital for various applications, including:

Summaries: Creating accurate summaries of meetings, interviews, and conferences.
Customer Insights: Analyzing customer calls to gather insights and improve service.
Metadata Tagging: Adding accurate tags to audio content for better search and organization.
Qualitative Research: Synthesizing qualitative research data, such as user and market research, to derive meaningful insights.

How Benchmarks Are Conducted

The benchmark study used over 250 hours of audio data from various sources, including public datasets and in-house recordings. These datasets cover a wide range of speech types, including phone calls, podcasts, and broadcasts. By testing different models on these datasets, the study ensured a comprehensive evaluation of each model’s performance.