Skip to content
Back to blog
Engineering

April 30, 2026 · 9 min read

Five platforms, one harness: a head-to-head live translation benchmark

We put five live translation platforms head-to-head. Here's the methodology behind the comparison, and where our system lands in the industry.

VoiceFrom engineering article cover

Have you ever used an end-to-end live translation system, or seen one in action at an event? Ever wondered how anyone decides which one is best? Here’s how we evaluate ours, and how it stacks up against four of the competitors we hear about most often: Palabra, LiveVoice Pro, OpenAI GPT-Realtime-Translate, and Google Meet.

What we score, and how

Two scores capture what listeners actually experience: how accurate the translation is, and how long they wait for it. Both line up with how a careful human listener would grade the same audio.

GEMBA-MQM v2

GEMBA-MQM v2 uses a large language model to read what the speaker said and what the system translated side by side, and returns a list of the mistakes the system made. Each mistake is tagged by type (mistranslation, omission, hallucination, and so on) and severity. Severity sets the weight each error contributes to the session’s final score:

SeverityWeightMeaning
CRITICAL25Comprehension-blocking: the listener cannot understand what was meant
MAJOR5Disrupts flow: the meaning is recoverable but with effort
MINOR1Awkward but understandable
PUNCTUATION0.1Minor formatting (subset of minor)

The session score is the negated sum of those weights, so closer to zero is better. A worked example, from a Spanish science talk:

Speaker (Spanish): “…si pudiéramos vivir sin agua…”

System’s translation: “…if we could live with water…”

GEMBA returns this:

ErrorTypeSeverityWeight
”sin agua” (“without water”) rendered as “with water”, reversing the meaning entirelymistranslationCRITICAL25

Score: −25. A single LLM judgment is noisy, so we run the scoring 10 times and aggregate the results. For the full methodology, see our GEMBA-MQM v2 deep-dive.

Ear-Voice Span

For each phrase the speaker says, we find the matching translated phrase, then measure the time gap between when the source phrase started and when the translation started. That gap, in seconds, is the Ear-Voice Span (EVS) for that phrase.

Diagram titled 'Ear-Voice Span: time from source phrase to its translation'. Two stacked waveforms: the top track is the French speaker's audio with the phrase 'un univers ponctuel' highlighted; the bottom track is the system's English audio with 'a punctual universe' highlighted. A labeled measurement line between them reads 'EVS = 3.4 seconds'.

Across a session we report the median (typical delay) and the 90th percentile (the slowest 10% of phrases). For the full methodology, see our EVS deep-dive.

Five platforms, head-to-head

We ran our evaluation on five systems using the same harness:

  • VoiceFrom Pro
  • LiveVoice Pro
  • Palabra
  • OpenAI GPT-Realtime-Translate
  • Google Meet

Eight sessions of around six minutes each, across eight language pairs spanning five languages (English, Spanish, French, German, Portuguese). Same audio in, scored the same way. Google does not support translating from Spanish into Portuguese, so it has seven sessions in the chart instead of eight.

Translation accuracy

Grouped bar chart of normalized GEMBA error magnitude (lower is better) for VoiceFrom, OpenAI, Google, LiveVoice, and Palabra across eight language pairs. A star marks the winning bar in each pair: VoiceFrom takes six, OpenAI wins PT→EN, and LiveVoice wins EN→ES. Google has no bar for ES→PT. Across pairs the average score is −51 for VoiceFrom, −107 for OpenAI, −115 for LiveVoice, −164 for Palabra, and −333 for Google.

SystemAverage GEMBA score
VoiceFrom−51
OpenAI−107
LiveVoice−115
Palabra−164
Google−333

VoiceFrom leads in average accuracy by a wide margin.

Speech latency

Bar chart of median Ear-Voice Span, sorted fastest to slowest: 3.4 seconds for Google, 5.4 seconds for OpenAI, 7.3 seconds for VoiceFrom, 7.9 seconds for Palabra, 10.0 seconds for LiveVoice.

Google and OpenAI are both faster than VoiceFrom at the median. The accuracy chart above shows what each pays for that speed.

Accuracy is where the platforms separate. The three sessions below show what that gap sounds like.

A French physics joke

Listen to demo Selected: Speaker (French original)

Caption: French physics talk → English. Original alongside each system's live translation. The wordplay on 'ponctuel' is around the 4:25 mark of the source.

The speaker is explaining cosmology in French. They reach the moment of the Big Bang and describe it as “un univers ponctuel”: a point-like universe. Then they make a joke: “ponctuel, ça veut pas dire qu’il est arrivé à l’heure”. Punctual doesn’t mean it arrived on time. The joke only works for two reasons. First, ponctuel means both point-like (the physics term) and punctual (the everyday word), so the setup needs to land both meanings in the same word. Second, the punchline has to keep the physics right: the universe is the size of a point, not that it came from a point.

Palabra and OpenAI both reach for the technical translation, point-like, and lose the wordplay outright. Google does the same with one-off, which has no double meaning to anchor the punchline. LiveVoice keeps the word punctual but stretches the punchline across three sentences:

“…a punctual universe. Now, punctual doesn’t mean it arrived on time. It means it arrived from a point. It’s the size of a dot.”

The joke wants two beats. LiveVoice gets there, but only after a wrong detour. Only VoiceFrom Pro keeps the setup-and-punchline timing intact:

“…a punctual universe. Punctual does not mean that it arrived on time. It means that it is the size of a point.”

The same word, punctual, carries from the setup into the punchline, and the physics lands the way the speaker intended.

A Sinek contrast in German

Listen to demo Selected: Speaker (English original)

Caption: Simon Sinek's 'Start With Why' (English) → German. Original alongside each system's live translation. The 'goal is not / goal is' contrast comes around the 5:25 mark of the source.

The talk is Simon Sinek’s “Start With Why,” translated into German. Late in the talk, Sinek lands on his central contrast: “The goal is not to do business with everybody who needs what you have. The goal is to do business with people who believe what you believe.” The whole argument hinges on both halves landing. The structure is parallel by design: the goal is not… the goal is…, and an audience listening in German needs that parallel to carry across the language boundary.

LiveVoice inverts the first half, slipping in sondern (“but rather”) to affirm the very thing the source rules out. Palabra splits the negation across three sentences and lands in the same place — “Es ist mit jedem zu Geschäfte zu machen, der das braucht…” — affirming the people the source excludes. Google keeps the parallel structure but truncates the second half into three short fragments where one sentence should land. OpenAI keeps the structure and the meaning, with awkward German in the first half (“mit allen Geschäften zu machen” reads closer to “with all stores” than “with everyone”). Only VoiceFrom lands the parallel, the meaning, and the phrasing in one clean beat:

“Das Ziel ist es nicht, Geschäfte mit jedem zu machen, der das braucht, was sie haben. Das Ziel ist es, Geschäfte mit Menschen zu machen, die glauben, was sie glauben.”

The two halves stay distinct, the negation falls on the right side of the contrast, and the rhetorical beat lands the way Sinek wrote it.

A Spanish aside about anti-anxiety medication

Listen to demo Selected: Speaker (Spanish original)

Caption: Spanish science talk → English. Original alongside each system's live translation. The Rivotril aside comes around the 1:35 mark of the source.

A few minutes into the talk, the speaker is asking listeners to imagine cosmic infinity, then breaks the tension with an aside: “…algo que no termina, y no termina, y no termina, y no termina, y ¡un Rivotril, por favor!” Rivotril is a brand-name anti-anxiety medication in Spanish-speaking countries; asking for one is shorthand for I need to calm down. The line only works if the listener understands that a real drug is being named. Without that, the speaker just sounds like she’s listing random words.

Every other system misses the drug. LiveVoice hears “a rivet roll, please.” Palabra hears “a river drill, please.” OpenAI hears “a river trail, please.” Google hears “arrive at real, please.” Each one turns the aside into an absurd non-sequitur. VoiceFrom Pro lands within one letter of the real drug name, and the joke survives:

“…something that never ends and never ends and never ends and never ends. And also, could I please have a ribotril? Because that thought gave me a lot of anxiety, since I was a child.”

Try it at your next event

If accuracy is what your audience needs, VoiceFrom is the system to put on stage. Schedule a call and we’ll set up a pilot at your next event.

For the full methodology behind the numbers above, see the metric deep-dives: GEMBA-MQM v2 for accuracy, and Ear-Voice Span for latency.

Portrait avatar of Yahya Saleh

Yahya Saleh

Applied ML Engineer

Yahya is an applied ML engineer at VoiceFrom. He builds the production-grade live speech-to-speech translation pipeline, turning recent research into systems that actually ship.