Practical MT Evaluation For Translators Subjective judgements reveal nothing about a translator’s experience

Thank you, Isabella Massardo, for your article, Who Is A Translator’s New Best Friend?, as part of our Blogging Translator Review Program. When she invited me to write a guest post, I decided to clarify Slate‘s evaluation scores and demonstrate how they help you compare your engine’s performance to another engine, like Google’s. This article is a re-post of my guest post on her blog: http://massardo.com/blog/mt-evaluation/

This re-post also includes two appendices that made the article too long for Isabella’s blog. The first is a glossary of our score terms. The second shows twelve (12) example source-target segments plus the output translations from Isabella’s engine and Google.


The Need For MT Evaluation

Subjective observations of machine translation (MT) linguistic quality are simple and easy for 35-40 words in a few example segments, but they reveal nothing about long-term translation quality or the translator’s experience across several projects of 10,000 words each.

A truly objective, accurate and automated evaluation of MT linguistic quality is beyond today’s state of the art. In fact, this deficit is what leads to the poor quality of MT output in the first place. This doesn’t mean MT is useless because translators are using MT every day.

What are MT evaluations good for if they can’t accurately report a translation’s quality?

Slate’s evaluation scores do not tell you about the quality of an engine’s translations. Instead, Slate focuses on describing engine criteria that can be measured objectively. Here, I generically refer to these criteria as an engine’s “linguistic performance.” The scores indicate how an engine might reduce or increase a translator’s workload compared to another engine. With objective evaluation scores, you can better predict how an engine might affect your work efficiency in the long term.

So, let’s look at the best practices of MT evaluation. Then, I’ll review Isabella’s engine scores with a focus on how they relate to her client’s work. Finally, I’ll compare Google’s output from the same evaluation segments with Isabella’s engine results.

Evaluation Best Practices

Current MT evaluation best practices require an evaluation set with 2,000-3,000 source-target segment pairs. The source segments represent the variety of work that the translator is likely to encounter. The target segments represent the desired reference translations.

The evaluation process uses the MT engine you’re evaluating to create “test” segments from the evaluation set’s source segments. It then measures each “test” segment against its respective “reference” and assigns a “closeness” score. These are like fuzzy match scores, but for target-to-test segments not source-to-TM segments. The process accumulates individual scores, like an average, to describe how the engine performed with that evaluation set. A performance descriptions for one engine has some value, but it’s much more valuable to compare descriptions of one evaluation set from different engines to tell us which engine performs better.

Measuring Isabella’s Engine

Isabella reported she started with three .tmx files and 250,768 segment pairs from the same client since 2003. Her Engine Summary (image below) shows Slate built Isabella’s engine from 119,053 segments after it removed 131,715 segment pairs (53%) for technical reasons. You can learn more about translation memory preparation on our support site.

Slate randomly removed and set aside 2,353 segment pairs that represent Isabella’s 14 years of work as the evaluation set leaving only 116,700 pairs to create the engine’s statistical models. During the evaluation process, the source segments are like a new project from the engine’s viewpoint. That is, the engine is not recalling segments that were used to build it. This evaluation strategy gives a 95% confidence that the engine will perform similarly when Isabella gets a new project from this client.

Isabella’s Engine vs Google

Before I could compare the performance of Isabella’s engine to Google, Isabella graciously granted me permission to translate her evaluation set’s 2,353 source segments using Google Translate. Here are Google’s evaluation scores side-by-side with Isabella’.

Evaluation Set
Segment count 2,353
Average segment length (words per segment) 16.5
Evaluation Scores Google Translate
en-it
Isabella’s
en-it-ns_test
Date 2017-08-11 2017-07-29
Evaluation BLEU score (all) 33.07 69.33
Evaluation BLEU score (1.0 filtered) 32.47 61.82
Quality quotient 4.33% 29.75%
Edit Distance per line (non-zero) 42 32
Exact matches count 102 700
Edit Distance entire project 93,605 52,856
Average segment length (exact matches) 4.7 11.4

This Engine Summary table includes a variety of scores, but these are the three that I rely on the most: the Average sentence length, the Quality quotient, and the Evaluation BLEU score (1.0 filtered).

The average segment length of source segments in the evaluation set tells us if Isabella’s translation memories are heavily weighted with terms, such as from a termbase. Isabella’s 16.5 average above is normal and the translation memories likely include a good balance of short and long segments. If the average were very small (for example 4 or 5 words), the engine will work poorly with long sentences.

The quality quotient (QQ) score means its likely that Isabella will simply review up to 30% of segments as exact matches when she works with her engine and her client’s future projects. Exact matches with this engine are 7 times more likely than if she did the same work with Google.

The evaluation BLEU score (filtered) represents the amount of typing and/or dictation work Isabella will need to do when her engine fails to suggest an exact match. Her engine’s score of 61.8 indicates her engine’s segments are likely to require less work than segments from Google with a score of 32.5. It’s important to note that this evaluation set’s Google BLEU score is comparable to Google scores with other published evaluation scores.

Putting It All Together

Isabella described her translation memories as client-specific with mostly her translations, those of a trusted colleague and some from unknown colleagues. She said, “All in all, a great mess” because they contain some terminological discrepancies, long convoluted segments, and other one-word long segments. She created her engine on her 4-year-old laptop computer in less than a day without any specialized training.

Isabella’s evaluation set is a representative subset of the TMs that Slate created to build the engine. The evaluation set’s scores show that her engine significantly outperforms Google Translate in every measured category. Furthermore, because of how Slate created the evaluation set and her translation memories are primarily her work specific to her client, she has a 95% likelihood of experiencing similar performance with future work from that client.

When Isabella works on projects with Slate, her engine is likely to give her 7 of year 10 segments that require changes (the converse of the QQ). Like many users, she might find these suggestions overwhelming because she’s accustomed to the CAT hiding the suggestions from poor fuzzy matches. Still, 70% represents much less work than the 96% she would likely receive from Google. With a little practice, it’s easy and fast to trash segments that require radical changes and start from scratch.

There’s no way to predict how her engine will perform with work from other clients or other subject matter. The nature of the statistical machine translation technology tells us that the performance will degrade as a project’s linguistic contents diverge from her engine corpus’ contents. Isabella’s engine could drop significantly for projects with disparate linguistic content. Fortunately, Isabella controls her engine and Slate gives her some tools to clean up the “great mess,” for example by forced terminology files to resolve the terminological discrepancies.

This was her first engine and she can experiment to her heart’s content. She can create as many engines as she likes. She can mix various translation memories and compare their performance, much like I compared her engine to Google in this article. Furthermore, she can experiment without any additional cost. If she has translation memories for five clients, she can create one engine for each of them or one that combines all. I look forward to hearing about her experiments.

When using Google Translate, Isabella needs to wait for Google to update and improve their engine. For example, her Google results reflect their recent update their en-it engine to NMT and these scores reflect those improvements. To Google’s credit, it handles variations across different subjects better than Isabella’s engine likely will. As Isabella pointed out, Google “has been constantly improving since inception.” So, across many different subjects, Google will continue to deliver 4% to 5% exact matches.

Fortunately, Isabella doesn’t face an either-or decision. Isabella’s first Slate Desktop engine performs well with her client’s projects, but we don’t know how it will perform with other projects. It costs her nothing to try it or improve it. Finally, she can also use Google whenever she feels it might be beneficial.

 

Appendix I – Scores Glossary

Slate’s Engine summary and table above present standard and customized machine translation scores. These scores are designed to help you predict the performance you can expect when using the engine.

Evaluation BLEU score (all)

The Evaluation BLEU score (all) is a standard score for the accumulative closeness (like an average) of all segments in the entire evaluation set. Higher scores, like the 69.33 from Isabella’s engine, mean its test segments more closely match the evaluation set’s human references. Lower scores, like the Google engine’s 33.07 score, mean the engine’s test segments are less of a match to the references.

Let’s look at BLEU scores in light of your translation memory experience. A segment’s fuzzy score indicates how closely a project’s source segment matches a source segment in your translation memories. A BLEU score is a closeness score that indicates how closely the MT engine’s test segment matches the reference segment. Higher scores mean a closer match. A score of 100 (also 1.0) means the two are an exact match.

Evaluation BLEU score (1.0 filtered)

The Evaluation BLEU score (1.0 filtered) is a customized accumulated BLEU score that represents the amount of typing or speech dictation (work) if you were to correct the test segments to match the references in the evaluation set. This accumulative BLEU score is calculate after removing the BLEU 100 segments. It drops to accurately represent the translator’s extra workload for the non-exact match test segments.

Isabella’s results demonstrate this effect. The Google BLEU score dropped from 33.07 to 32.47 (0.6 points) because there are 102 exact matches out of the evaluation set’s 2,353 segments. Isabella’s engine dropped from 69.33 to 61.82 (7.51 points) because there are 700 exact matches in the set.

Quality quotient

The Quality quotient (QQ) is a customized score that reports the percentage of exact match test segments for a given engine with that evaluation set. Like an intelligence quotient (IQ), the QQ represents the engine’s capacity to create segments that require zero editing. When a translation project closely matches the evaluation set, the translator using that engine will experience nearly this percentage of exact match suggestions and a representative number of close matches. When a project’s content diverges from the evaluation set, the translator will receive fewer exact and close matches.

Edit Distance per line (non-zero)

The Edit Distance per line (non-zero) score is a customized score that reports the accumulated edit distance of test segments divided by the number of segments that have non-zero edit distance scores. That is, it’s the average edit distance per segment for non-exact matches.

Edit distance (aka Levenshtein distance) is an alternate scoring system to BLEU. The score represents the number of character changes that were necessary to transform a test to a reference. It’s like BLEU because you can calculate a score per segment of accumulate scores for the evaluation set. It’s opposite of BLEU because low scores represent less work and exact match test segments score zero, not 100. Some CAT tools, like memoQ, can report edit distance scores for your projects.

Exact matches count

The Exact matches count is simply the number of exact match test segments that the engine created. This number is in the table for convenience but it is not shown the screen shot’s Engine Summary.

Edit Distance entire project

The Edit Distance entire project is a standard score that reports the accumulative edit distance for the entire evaluation set. This score is in the table for convenience but it is not in the screen shot’s Engine Summary.

Average segment length (exact matches)

Statistical machine translation performance degrades with longer sentences. This table shows the Average segment length for exact matches by engine. Isabella’s engine does a much better job at creating exact matches for longer sentences than Google.


 

Appendix II – Evaluation Set Examples

This table shows when Isabella’s engine and Google agree and disagree. I selected the three longest segments from each of the four categories to share as examples. The number in parentheses is the test segment’s BLEU score.

  Google
Not Matches Exact Matches Total
Isabella’s
Engine
Not Matches 1627
(69%)
26
(1%)
1,653
(70%)
Exact Matches 624
(27%)
76
(3%)
700
(30%)
Total 2,251
(96%)
102
(4%)
2,353
(100%)

1

Neither Isabella’s nor Google’s are exact matches
1,627 of 2,353 = 69%

EN Source: Unless otherwise specified in these Terms, all information and screens appearing on this App, including documents, services, App design, text, graphics, logos, images and icons, as well as the arrangement thereof, are the sole property of TRADEMARK.
IT Reference: Se non diversamente specificato nelle presenti Condizioni, tutte le informazioni e le schermate che appaiono sull’Applicazione, tra cui documenti, servizi, design dell’Applicazione, testo, grafica, loghi, immagini e icone, come pure la disposizione stessa, sono di proprietà esclusiva di TRADEMARK.
Isabellas (82.5 BLEU): Se non diversamente specificato nelle presenti Condizioni, tutte le informazioni e le schermate che appare sull’Applicazione, tra cui documenti, servizi, design dell’Applicazione, testo, grafici), loghi, immagini e icone, nonché con la stessa, sono di proprietà esclusiva di TRADEMARK.
Googles (52.0 BLEU): Salvo diversa indicazione nei presenti Termini, tutte le informazioni e le schermate che appaiono su questa applicazione, inclusi documenti, servizi, disegno di app, testo, grafica, loghi, immagini e icone, nonché la loro disposizione, sono la sola proprietà di TRADEMARK.

EN Source: Such information might contain personalized or general information about TRADEMARK, its products, its business opportunity; but also newsletters or alerts, invitations to upcoming events; or generally, information about any other related matters that the Distributor may deem to be of interest to you.
IT Reference: Tali dati potrebbero contenere dati personalizzati o generali su TRADEMARK, i suoi prodotti, la sua opportunità commerciale; ma anche le newsletter o avvisi, inviti a eventi imminenti; o, in generale, informazioni su eventuali altre questioni relative che l’Incaricato può ritenere siano importanti per voi.
Isabellas (81.4 BLEU): Tali dati potrebbero contenere dati personalizzati o generali su TRADEMARK, i suoi prodotti, la sua opportunità commerciale; ma anche le newsletter o annunci, inviti a eventi di prossima; o, in generale, informazioni su eventuali altre questioni relative che l’Incaricato può essere ritenutiinteresse.
Googles (31.7 BLEU): Tali informazioni potrebbero contenere informazioni personalizzate o generali su TRADEMARK, sui suoi prodotti, sulla sua opportunità di business; Ma anche newsletter o avvisi, inviti a eventi prossimi; O generalmente, informazioni su qualsiasi altra questione relativa che il Distributore possa ritenere interessante.

EN Source: This prestigious partnership, combined with the new partnership with the UNIVERSITY, will deepen the company’s understanding of genes and their impact on skin ageing, so we can use this knowledge to continuously develop new, innovative science far into the future.
IT Reference: Grazie a questa collaborazione prestigiosa, unitamente al nuovo rapporto di collaborazione con la UNIVERSITY, la Società potrà approfondire la propria comprensione delle componenti genetiche e del loro impatto sull’invecchiamento della pelle. Potremo quindi utilizzare queste conoscenze per sviluppare una nuova scienza innovativa per molti anni a venire.
Isabellas (76.7 BLEU): questa collaborazione prestigiosa, unitamente al nuovo rapporto di collaborazione con la UNIVERSITY, la Società potrà approfondire la propria comprensione delle componenti genetiche e del loro impatto sull’invecchiamento della pelle. Potremo quindi utilizzare queste conoscenze per sviluppare una nuova scienza innovativa per molti anni a venire.
Googles (23.1 BLEU): Questo prestigioso partenariato, unitamente alla nuova partnership con UNIVERSITY, approfondirà la comprensione della società sui geni e sul loro impatto sull’invecchiamento cutaneo, in modo da poter utilizzare questa conoscenza per sviluppare continuamente nuove e innovative scienze in futuro.

2

Google’s is an exact match; Isabella’s is not
26 of 2,353 = 1%

EN Source: Butter, olive oil, margarine, canola oil, sunflower oil, sesame oil.
IT Reference: Burro, olio d’oliva, margarina, olio di canola, olio di semi di girasole, olio di sesamo.
Isabellas (80.3 BLEU): Burro, olio d’oliva, margarina, olio di colza, olio di girasole, olio di sesamo.
Googles (100 BLEU): Burro, olio d’oliva, margarina, olio di canola, olio di semi di girasole, olio di sesamo.

EN Source: The number of variants selected exceeds the maximum.
IT Reference: Il numero di varianti selezionate supera il massimo.
Isabellas (65.8 BLEU): Il numero di variants selezionate supera il massimo.
Googles (100 BLEU): Il numero di varianti selezionate supera il massimo.

EN Source: Many scan results are above 40,000.
IT Reference: Molti risultati di scansione sono superiori a 40.000.
Isabellas (66.9 BLEU): Molti risultati di scansione sono superiori 40.000.
Googles (100 BLEU): Molti risultati di scansione sono superiori a 40.000.

3

Isabella’s is an exact match; Google’s is not
624 of 2,353 = 27%

EN Source: If in the opinion of the Company, you are buying and returning Products merely to maintain your Executive status level, the Company may at its discretion change your distributor title to reflect the level of your activity as set forth in the SCP.
IT Reference: Qualora la Società ritenga che l’Incaricato stia acquistando e restituendo prodotti esclusivamente per conservare la propria qualifica di Executive, essa potrà a sua discrezione modificare il titolo dell’Incaricato in modo da rispecchiare il suo livello di attività secondo quanto indicato nel Piano dei Compensi.
Isabellas (100 BLEU): Qualora la Società ritenga che l’Incaricato stia acquistando e restituendo prodotti esclusivamente per conservare la propria qualifica di Executive, essa potrà a sua discrezione modificare il titolo dell’Incaricato in modo da rispecchiare il suo livello di attività secondo quanto indicato nel Piano dei Compensi.
Googles (15.9 BLEU): Se, a giudizio della Società, stai acquistando e restituendo Prodotti solo per mantenere il tuo livello di Esecutivo, la Società potrà a sua discrezione modificare il tuo titolo di distributore per riflettere il livello della tua attività come indicato nella SCP.

EN Source: As the company continues to develop comprehensive health and skin care solutions for the anti-ageing industry, the TRADEMARK® Research Centre will become a beacon of innovation and integrity, setting the standard for others to follow.
IT Reference: Mentre la Società continua a sviluppare soluzioni complete per la salute e la cura dermatologica destinate all’industria dei prodotti anti-età, il TRADEMARK® Research Centre diventerà un punto di riferimento in termini di innovazione e integrità, e definirà gli standard che altri seguiranno.
Isabellas (100 BLEU): Mentre la Società continua a sviluppare soluzioni complete per la salute e la cura dermatologica destinate all’industria dei prodotti anti-età, il TRADEMARK® Research Centre diventerà un punto di riferimento in termini di innovazione e integrità, e definirà gli standard che altri seguiranno.
Googles (27.9 BLEU): Poiché la società continua a sviluppare soluzioni complete per la cura della pelle e della pelle per l’industria anti-invecchiamento, TRADEMARK® Business Center per la ricerca antinvecchiamento diventerà un fulcro di innovazione e integrità, impostando lo standard per gli altri a seguire.

EN Source: BRAND_NAME – used in conjunction with TRADEMARK®, this comb-like conductor glides easily through the hair and maintains close contact with the scalp to deliver the benefits of the treatment product to the hair and hair follicles.
IT Reference: BRAND NAME – Utilizzato in combinazione con TRADEMARK®, questo conduttore, simile a un pettine, scivola facilmente fra i capelli e mantiene il contatto con il cuoio capelluto per distribuire i benefici del prodotto ai capelli e ai follicoli.
Isabellas (100 BLEU): BRAND NAME – Utilizzato in combinazione con TRADEMARK®, questo conduttore, simile a un pettine, scivola facilmente fra i capelli e mantiene il contatto con il cuoio capelluto per distribuire i benefici del prodotto ai capelli e ai follicoli.
Googles (34.8 BLEU): NAME – utilizzato in combinazione con il trattamento di capelli di TRADEMARK®, questo conduttore pettine scivola facilmente attraverso i capelli e mantiene stretto contatto con il cuoio capelluto per offrire i benefici del prodotto di trattamento ai capelli e ai follicoli dei capelli.

4

Both Isabella’s and Google’s are exact matches
76 of 2,353 = 3%

EN Source: Green tea and aqueous extracts of green tea are a natural source of polyphenols, which are large organic molecules.
IT Reference: Il tè verde e gli estratti acquosi di tè verde sono una fonte naturale di polifenoli, che sono grandi molecole organiche.
Isabellas (100 BLEU): Il tè verde e gli estratti acquosi di tè verde sono una fonte naturale di polifenoli, che sono grandi molecole organiche.
Googles (100 BLEU): Il tè verde e gli estratti acquosi di tè verde sono una fonte naturale di polifenoli, che sono grandi molecole organiche.

EN Source: The number of free radicals in the outermost layer of the skin
IT Reference: Il numero di radicali liberi nello strato più esterno della pelle
Isabellas (100 BLEU): Il numero di radicali liberi nello strato più esterno della pelle
Googles (100 BLEU): Il numero di radicali liberi nello strato più esterno della pelle

EN Source: A reputation for excellence and a history of success.
IT Reference: Una reputazione di eccellenza e una storia di successo.
Isabellas (100 BLEU): Una reputazione di eccellenza e una storia di successo.
Googles (100 BLEU): Una reputazione di eccellenza e una storia di successo.

 

  •  
    19
    Shares
  • 3
  • 2
  •  
  • 14
  •  
  •  
  •  

Please share what's on your mind.