Five platforms, one harness: a head-to-head live translation benchmark
We put five live translation platforms head-to-head. Here's the methodology behind the comparison, and where our system lands in the industry.
On this page
Have you ever used an end-to-end live translation system, or seen one in action at an event? Ever wondered how anyone decides which one is best? Here’s how we evaluate ours, and how it stacks up against four of the competitors we hear about most often: Palabra, LiveVoice Pro, OpenAI GPT-Realtime-Translate, and Google Meet.
What we score, and how
Two scores capture what listeners actually experience: how accurate the translation is, and how long they wait for it. Both line up with how a careful human listener would grade the same audio.
GEMBA-MQM v2
GEMBA-MQM v2 uses a large language model to read what the speaker said and what the system translated side by side, and returns a list of the mistakes the system made. Each mistake is tagged by type (mistranslation, omission, hallucination, and so on) and severity. Severity sets the weight each error contributes to the session’s final score:
| Severity | Weight | Meaning |
|---|---|---|
| CRITICAL | 25 | Comprehension-blocking: the listener cannot understand what was meant |
| MAJOR | 5 | Disrupts flow: the meaning is recoverable but with effort |
| MINOR | 1 | Awkward but understandable |
| PUNCTUATION | 0.1 | Minor formatting (subset of minor) |
The session score is the negated sum of those weights, so closer to zero is better. A worked example, from a Spanish science talk:
Speaker (Spanish): “…si pudiéramos vivir sin agua…”
System’s translation: “…if we could live with water…”
GEMBA returns this:
| Error | Type | Severity | Weight |
|---|---|---|---|
| ”sin agua” (“without water”) rendered as “with water”, reversing the meaning entirely | mistranslation | CRITICAL | 25 |
Score: −25. A single LLM judgment is noisy, so we run the scoring 10 times and aggregate the results. For the full methodology, see our GEMBA-MQM v2 deep-dive.
Ear-Voice Span
For each phrase the speaker says, we find the matching translated phrase, then measure the time gap between when the source phrase started and when the translation started. That gap, in seconds, is the Ear-Voice Span (EVS) for that phrase.

Across a session we report the median (typical delay) and the 90th percentile (the slowest 10% of phrases). For the full methodology, see our EVS deep-dive.
Five platforms, head-to-head
We ran our evaluation on five systems using the same harness:
- VoiceFrom Pro
- LiveVoice Pro
- Palabra
- OpenAI GPT-Realtime-Translate
- Google Meet
Eight sessions of around six minutes each, across eight language pairs spanning five languages (English, Spanish, French, German, Portuguese). Same audio in, scored the same way. Google does not support translating from Spanish into Portuguese, so it has seven sessions in the chart instead of eight.
Translation accuracy

| System | Average GEMBA score |
|---|---|
| VoiceFrom | −51 |
| OpenAI | −107 |
| LiveVoice | −115 |
| Palabra | −164 |
| −333 |
VoiceFrom leads in average accuracy by a wide margin.
Speech latency

Google and OpenAI are both faster than VoiceFrom at the median. The accuracy chart above shows what each pays for that speed.
Accuracy is where the platforms separate. The three sessions below show what that gap sounds like.
A French physics joke
The speaker is explaining cosmology in French. They reach the moment of the Big Bang and describe it as “un univers ponctuel”: a point-like universe. Then they make a joke: “ponctuel, ça veut pas dire qu’il est arrivé à l’heure”. Punctual doesn’t mean it arrived on time. The joke only works for two reasons. First, ponctuel means both point-like (the physics term) and punctual (the everyday word), so the setup needs to land both meanings in the same word. Second, the punchline has to keep the physics right: the universe is the size of a point, not that it came from a point.
Palabra and OpenAI both reach for the technical translation, point-like, and lose the wordplay outright. Google does the same with one-off, which has no double meaning to anchor the punchline. LiveVoice keeps the word punctual but stretches the punchline across three sentences:
“…a punctual universe. Now, punctual doesn’t mean it arrived on time. It means it arrived from a point. It’s the size of a dot.”
The joke wants two beats. LiveVoice gets there, but only after a wrong detour. Only VoiceFrom Pro keeps the setup-and-punchline timing intact:
“…a punctual universe. Punctual does not mean that it arrived on time. It means that it is the size of a point.”
The same word, punctual, carries from the setup into the punchline, and the physics lands the way the speaker intended.
A Sinek contrast in German
The talk is Simon Sinek’s “Start With Why,” translated into German. Late in the talk, Sinek lands on his central contrast: “The goal is not to do business with everybody who needs what you have. The goal is to do business with people who believe what you believe.” The whole argument hinges on both halves landing. The structure is parallel by design: the goal is not… the goal is…, and an audience listening in German needs that parallel to carry across the language boundary.
LiveVoice inverts the first half, slipping in sondern (“but rather”) to affirm the very thing the source rules out. Palabra splits the negation across three sentences and lands in the same place — “Es ist mit jedem zu Geschäfte zu machen, der das braucht…” — affirming the people the source excludes. Google keeps the parallel structure but truncates the second half into three short fragments where one sentence should land. OpenAI keeps the structure and the meaning, with awkward German in the first half (“mit allen Geschäften zu machen” reads closer to “with all stores” than “with everyone”). Only VoiceFrom lands the parallel, the meaning, and the phrasing in one clean beat:
“Das Ziel ist es nicht, Geschäfte mit jedem zu machen, der das braucht, was sie haben. Das Ziel ist es, Geschäfte mit Menschen zu machen, die glauben, was sie glauben.”
The two halves stay distinct, the negation falls on the right side of the contrast, and the rhetorical beat lands the way Sinek wrote it.
A Spanish aside about anti-anxiety medication
A few minutes into the talk, the speaker is asking listeners to imagine cosmic infinity, then breaks the tension with an aside: “…algo que no termina, y no termina, y no termina, y no termina, y ¡un Rivotril, por favor!” Rivotril is a brand-name anti-anxiety medication in Spanish-speaking countries; asking for one is shorthand for I need to calm down. The line only works if the listener understands that a real drug is being named. Without that, the speaker just sounds like she’s listing random words.
Every other system misses the drug. LiveVoice hears “a rivet roll, please.” Palabra hears “a river drill, please.” OpenAI hears “a river trail, please.” Google hears “arrive at real, please.” Each one turns the aside into an absurd non-sequitur. VoiceFrom Pro lands within one letter of the real drug name, and the joke survives:
“…something that never ends and never ends and never ends and never ends. And also, could I please have a ribotril? Because that thought gave me a lot of anxiety, since I was a child.”
Try it at your next event
If accuracy is what your audience needs, VoiceFrom is the system to put on stage. Schedule a call and we’ll set up a pilot at your next event.
For the full methodology behind the numbers above, see the metric deep-dives: GEMBA-MQM v2 for accuracy, and Ear-Voice Span for latency.